Evaluation Framework
Evaluating Complex Agent Systems at Scale
Constructing evaluations for a dynamically orchestrated, general-purpose agent is one of the hardest challenges in production AI systems. How do you test an agent that can generate arbitrary pipelines, access vast amounts of data, and take complex actions?
Our solution: build a comprehensive sandbox environment that mirrors production, run hundreds of realistic scenarios, and verify the agent produces reasonable results. Here's how we built an evaluation framework that gives us confidence in our agent's performance.
The Challenge: Testing Dynamic Intelligence
The Rox Agent operates over:
Private and public data from sellers' accounts
Meeting transcripts, emails, notes, and artifacts
Tools to create and manage entities in Rox
Dynamic pipeline generation based on user queries
Traditional testing approaches fail here. Unit tests can't capture emergent behavior. Integration tests struggle with the combinatorial explosion of possible agent actions. We needed something more sophisticated.
Building the Sandbox: Production Fidelity at Test Speed
Service Layer Abstraction
Rox's platform is built with strict service abstractions—all entity access and queries flow through well-defined service interfaces. This architectural decision pays massive dividends for testing.
Our sandbox environment works by:
Mocking each service with hardcoded production-like data
Replacing the Agent Context Service (responsible for reading/writing agent messages and ephemeral UI state) with a controlled mock
Maintaining strict interface contracts between the agent brain and data layer
This separation is critical. The agent brain interacts with data only through specified interfaces, whether that's:
Knowledge graph data via services
Conversation layer for message handling
Tool execution for actions
Because these interfaces are strictly defined, setting up accurate mocks is straightforward, and we have high confidence that our sandbox accurately reflects production behavior.
The Power of Architectural Boundaries
Our agent architecture's clean separation between brain and data layers isn't just good design—it's what makes comprehensive testing possible. By enforcing strict boundaries:
Mocks can accurately simulate production services
Tests remain stable despite implementation changes
Edge cases are reproducible and debuggable
The Evaluation Framework: Testing at Scale
Core Evaluation Pipeline
We run approximately 200 common queries through our sandbox environment, checking for reasonable outcomes. Since we control the data, we can:
Use heuristics to verify agent performance for straightforward cases
Deploy LLM-as-judge for more nuanced evaluations
Track exact tool calls and pipeline generation patterns
Each evaluation scenario includes:
Input query
Expected data access patterns
Success criteria (heuristic or LLM-judged)
Performance benchmarks
Continuous Evaluation Development
Our eval suite grows alongside our features:
Feature Development — New capabilities come with new eval scenarios
Regression Tracking — Key workflows are monitored across all changes
Edge Case Integration — Production issues become new test cases
This tight feedback loop ensures that as the agent becomes more capable, our confidence in its behavior grows proportionally.
Learning from Production: The Feedback Loop
From Edge Cases to Test Cases
When we encounter unexpected behavior in production, our process is:
Identify where sandbox datasets failed to capture the real-world pattern
Update sandbox data to reflect these patterns
Add specific eval scenarios for the edge case
Verify performance improvements
This continuous refinement means our sandbox becomes increasingly representative of production complexity over time.
Layered Testing Strategy
Beyond our agent-level evaluations, we maintain:
Unit tests for all tools and context layers
Interface tests ensuring mocks accurately reflect real services
Performance benchmarks tracking latency and resource usage
This isn't revolutionary—it's standard engineering practice. But combined with our sandbox evaluations, it creates a comprehensive testing pyramid that catches issues at the appropriate level.
Key Insights
Building evals for complex agent systems taught us several critical lessons:
Architecture Enables Testing — Clean service boundaries and interface definitions make comprehensive mocking possible
Control the Data, Control the Test — Owning the sandbox data lets us create deterministic, reproducible scenarios
Hybrid Evaluation Works — Combining heuristics for simple cases with LLM judges for complex ones balances accuracy and speed
Production Teaches Best — Edge cases from real usage are gold for improving evaluation coverage
The Result: Confidence at Scale
This evaluation framework gives us the confidence to:
Ship agent improvements rapidly
Catch regressions before they reach production
Understand exactly how changes affect agent behavior
Scale our agent's capabilities without sacrificing reliability
By treating evaluation as a first-class engineering challenge—not an afterthought—we've built a system that lets us innovate on agent capabilities while maintaining production stability.
Last updated