Agents
How we build agents at Rox
An agent retrieves three transcripts from the wrong account. It reasons over them, produces a confident brief on a customer relationship it has no real data for. Nothing fails. No error surfaces. The output reads well. A human catches it the next day, or doesn't.
This is the core problem in building agents outside of code. A coding agent can run what it writes, generate tests on the fly, check the output. It creates its own verification as it goes. Our agents can't. If retrieval goes wrong, the system doesn't know.
This post shows what we build instead, and where we're going. We've written about the AgentDataInterface and how we structure agent access to our Knowledge Graph. This covers the layer above that: how the agents work and why they're built the way they are.
The progression:
The verification loop that makes coding agents work doesn't exist in our domain.
What we build instead.
How we're building the loop.
No cheap verification
Coding agents are in a flywheel right now, and the reason is structural. A coding agent generates code, runs it, sees what broke, tries again. It can write its own tests, invoke linters, check types. The verification step is classical, deterministic, and nearly free. This is why dynamic context discovery works in code: if the agent grabs the wrong files, the feedback is immediate. Tests fail. The compiler complains. The agent retries with different context. Being wrong is cheap.
Our agents don't get that signal. They reason over meeting transcripts, email threads, CRM records, and web research. If the agent pulls the wrong three transcripts out of 150 to answer a question, nothing fails. There is no linter. The agent generates a confident, well-structured brief from the wrong context, and no signal tells it anything went wrong. A human eventually reads it and notices. Or doesn't.
The data makes this worse. New transcripts and emails arrive constantly, so pre-computed indexes go stale. Two customers in the same industry produce transcripts that look nearly identical to an embedding model, but the agent needs the right customer's data, not similar data. Metadata filtering that maps queries to the correct account and contact is a strict prerequisite before semantic retrieval can even start. And the signal that matters is often sparse: a pricing objection buried in one sentence of a 45-minute transcript is exactly what the agent needs to surface. Compression loses it. Exploratory retrieval buries it in noise.
To verify whether the agent retrieved the right context, you'd need to run an equivalently expensive retrieval process. Or a more expensive one. The verification step IS deep research. You can't build a cheap inner loop around that.
This single constraint shaped every decision described below.
What we build instead
Since we can't verify retrieval cheaply at runtime, we invest engineering effort into the layers where we can directly control quality: the data layer (what goes into the agent's context) and the action layer (what comes out).
The data layer: scored retrieval
We can't afford exploratory retrieval. We get one shot at putting the right context in front of the agent. So we spend LLM compute at the retrieval layer, before anything enters the agent's context window.
The key difference from exploratory search is that the agent explicitly declares what it needs. Instead of broad grep queries and hoping relevant chunks surface, the agent issues a natural language query expressing its specific information need along with metadata filters scoped to the right user, account, and contacts. Candidates come back through the AgentDataInterface. Those candidates fan out to an LLM-based parallel reranker that scores each one against that specific intent. A curated top-k, sized to the agent's remaining context budget, is what the agent actually sees.

One reranker pass over 150 transcript chunks is cheaper than an agent grinding through them with autocompact and grep. And the curated context is higher quality: the reranker evaluates every candidate against a focused retrieval objective, rather than the agent trying to do retrieval and reasoning simultaneously with no separation of concerns.
The same architecture applies to web research. Wide SERP retrieval, controlled extraction from top results, reranking before anything reaches the agent. The pipeline differs (web data is noisier, you don't control sources, extract quality varies) but the reason for the architecture is the same: no verification loop means retrieval has to be right on the first pass.
The reranker started as a solution for transcript retrieval and generalized into the pattern now used across the board. Brian covers the architecture in his post. For web research, evaluating retrieval quality itself requires running expensive deep research passes and comparing, which is the subject of Mehul and Sanchit's post. Pranav covers the infrastructure for running web research at scale in his post.
The action layer: structured output mid-reasoning
On the output side, our agents produce typed objects through tool calls as they reason, not as a post-processing step at the end. An email composition is a tool call with typed fields that the frontend renders directly as an editable draft. A contact list is a typed list[Person] that the frontend renders as interactive cards and that workflows can consume downstream.
Forcing structured output extraction at the end of an agent run confuses the model. Having the agent emit typed tool calls as it reasons just works, and lets the agent produce multiple structured artifacts per run. The typed objects are portable: they plug directly into Rox's workflow automation platform, where they can trigger downstream actions. Taeuk covers that platform in his post.
Orchestration: simplified
We used to have specialized sub-agents for research, artifact generation, and analysis, with a routing layer that decomposed queries. It worked but was brittle. Every new capability required new routing logic and new edge cases.
Frontier models made most of that unnecessary. The industry has converged on good patterns for planning and context management and frontier models handle them well. We use similar approaches. The engineering that matters is in the data and action layers, not the harness.
Building the loop
The scored retrieval architecture is what got us from handling a few customer relationships per session to over a hundred, pulling from hundreds of transcripts and email threads per run. But it's a ceiling, not a flywheel. We can engineer quality into each layer, but the system doesn't improve itself from its own outputs.
We have access to a feedback signal that coding agents will never have: real-world business outcomes. Deals close or they don't. Outreach sequences get responses or they don't. The problem is these signals are delayed by weeks or months, noisy, and hard to attribute back to specific agent actions.
Near-term, we're building offline evaluation agents that start to close this gap: deep research runs that crawl outcome data, comparing what our agents recommended against what actually happened. Too expensive for runtime, but as a nightly improvement signal, they're working. We're pointing our own deep research capabilities at ourselves.
Longer-term, we're building agents that participate directly in execution with clear outcome signals. The agent recommends a strategy, participates in executing it, and the result feeds back into how we score retrieval and structure recommendations. A system of agents, skills, and scoring functions that tunes itself over time from outcome data.

Coding agents got their verification loop for free. We have to construct ours from scratch, out of delayed and noisy outcome data, across systems that weren't designed to produce training signal.
If these are problems you want to work on: retrieval without clean verification signals, evals without ground-truth labels, building feedback loops from delayed real-world outcomes, agent systems over large-scale heterogeneous unstructured data where wrong answers have real consequences. We're hiring.
Deep dives from the team:
Brian: The reranker architecture
Mehul and Sanchit: Agentic web research and evaluation
Taeuk: Workflow automation at Rox
Last updated

