Evaluation Framework

Evaluating Complex Agent Systems at Scale

Constructing evaluations for a dynamically orchestrated, general-purpose agent is one of the hardest challenges in production AI systems. How do you test an agent that can generate arbitrary pipelines, access vast amounts of data, and take complex actions?

Our solution: build a comprehensive sandbox environment that mirrors production, run hundreds of realistic scenarios, and verify the agent produces reasonable results. Here's how we built an evaluation framework that gives us confidence in our agent's performance.

The Challenge: Testing Dynamic Intelligence

The Rox Agent operates over:

  • Private and public data from sellers' accounts

  • Meeting transcripts, emails, notes, and artifacts

  • Tools to create and manage entities in Rox

  • Dynamic pipeline generation based on user queries

Traditional testing approaches fail here. Unit tests can't capture emergent behavior. Integration tests struggle with the combinatorial explosion of possible agent actions. We needed something more sophisticated.

Building the Sandbox: Production Fidelity at Test Speed

Service Layer Abstraction

Rox's platform is built with strict service abstractions—all entity access and queries flow through well-defined service interfaces. This architectural decision pays massive dividends for testing.

Our sandbox environment works by:

  1. Mocking each service with hardcoded production-like data

  2. Replacing the Agent Context Service (responsible for reading/writing agent messages and ephemeral UI state) with a controlled mock

  3. Maintaining strict interface contracts between the agent brain and data layer

This separation is critical. The agent brain interacts with data only through specified interfaces, whether that's:

  • Knowledge graph data via services

  • Conversation layer for message handling

  • Tool execution for actions

Because these interfaces are strictly defined, setting up accurate mocks is straightforward, and we have high confidence that our sandbox accurately reflects production behavior.

The Power of Architectural Boundaries

Our agent architecture's clean separation between brain and data layers isn't just good design—it's what makes comprehensive testing possible. By enforcing strict boundaries:

  • Mocks can accurately simulate production services

  • Tests remain stable despite implementation changes

  • Edge cases are reproducible and debuggable

The Evaluation Framework: Testing at Scale

Core Evaluation Pipeline

We run approximately 200 common queries through our sandbox environment, checking for reasonable outcomes. Since we control the data, we can:

  • Use heuristics to verify agent performance for straightforward cases

  • Deploy LLM-as-judge for more nuanced evaluations

  • Track exact tool calls and pipeline generation patterns

Each evaluation scenario includes:

  • Input query

  • Expected data access patterns

  • Success criteria (heuristic or LLM-judged)

  • Performance benchmarks

Continuous Evaluation Development

Our eval suite grows alongside our features:

  1. Feature Development — New capabilities come with new eval scenarios

  2. Regression Tracking — Key workflows are monitored across all changes

  3. Edge Case Integration — Production issues become new test cases

This tight feedback loop ensures that as the agent becomes more capable, our confidence in its behavior grows proportionally.

Learning from Production: The Feedback Loop

From Edge Cases to Test Cases

When we encounter unexpected behavior in production, our process is:

  1. Identify where sandbox datasets failed to capture the real-world pattern

  2. Update sandbox data to reflect these patterns

  3. Add specific eval scenarios for the edge case

  4. Verify performance improvements

This continuous refinement means our sandbox becomes increasingly representative of production complexity over time.

Layered Testing Strategy

Beyond our agent-level evaluations, we maintain:

  • Unit tests for all tools and context layers

  • Interface tests ensuring mocks accurately reflect real services

  • Performance benchmarks tracking latency and resource usage

This isn't revolutionary—it's standard engineering practice. But combined with our sandbox evaluations, it creates a comprehensive testing pyramid that catches issues at the appropriate level.

Key Insights

Building evals for complex agent systems taught us several critical lessons:

  1. Architecture Enables Testing — Clean service boundaries and interface definitions make comprehensive mocking possible

  2. Control the Data, Control the Test — Owning the sandbox data lets us create deterministic, reproducible scenarios

  3. Hybrid Evaluation Works — Combining heuristics for simple cases with LLM judges for complex ones balances accuracy and speed

  4. Production Teaches Best — Edge cases from real usage are gold for improving evaluation coverage

The Result: Confidence at Scale

This evaluation framework gives us the confidence to:

  • Ship agent improvements rapidly

  • Catch regressions before they reach production

  • Understand exactly how changes affect agent behavior

  • Scale our agent's capabilities without sacrificing reliability

By treating evaluation as a first-class engineering challenge—not an afterthought—we've built a system that lets us innovate on agent capabilities while maintaining production stability.

Last updated