Smart Email Rule Engine
Not every email belongs in your context graph.
Your sales rep just closed a Rippling deal. The emails with their procurement team, the pricing threads, infosec review, the back-and-forth on contract terms. That's exactly the kind of context your revenue system should capture.
That same week, Rippling sent every employee at your company a pay stub. Same domain. Same rippling.com in the sender field. Completely different data.
A domain-based filter can't tell these apart. Block rippling.com and you lose the deal context. Allow it and you're ingesting compensation data. This isn't an edge case. It's the simplest example of the state of enterprise email when you try to build intelligence on top of it.
Basic systems that assigns emails to accounts using the domains truly expose the amount of noise in our enterprise mailboxes. The long tail of "looks legitimate but shouldn't be indexed" is enormous, and it shifts across organizations and industries.
This is why we built a multi-stage ingestion pipeline that combines deterministic rules with LLM-based classification. Every stage evaluates emails against shared defaults layered with org-specific rules, giving each customer full control over what enters their system.
This email ingestion engine fits into a set of rule engines that process all incoming data, only persisting relevant, anonymized, and policy-compliant information while filtering out noise, sensitive leakage, and content that doesn't belong.

Figure 1 The full smart rule engine that gates which data gets persisted in the knowledge graph and the Rox System
Three stories that break simple filters
Customer complaints about your rep
Your champion at a major account emails your VP of Sales: "We need to seriously talk about how Jeff handled the renewal conversation." That email is about the account, involves contacts your system tracks and references a deal. By every signal a naive pipeline looks for, this is exactly the kind of email it should ingest.
But it's a personnel complaint. If it lands in the context graph, it shows up in meeting briefs, relationship summaries, agent-generated account research. Now every rep who touches that account can read that a customer complained about their colleague. This email is HR content wearing a sales costume.
The M&A thread
Your CEO is in early acquisition talks with a company that happens to be one of your biggest customers. The emails go back and forth on a domain your sales team actively works. We risk acquisition financials and term sheets ending up in the same context graph your AE interfaces before their next QBR. That's not just a data quality issue, it’s a securities issue.
Legal hold on a customer account
Outside counsel emails your GC about a contract dispute with a customer. The domain matches an active account. The thread references specific deal terms, liability exposure, settlement figures. Attorney-client privilege doesn't survive being indexed into a sales intelligence platform.
Your legal team would rightfully lose their minds if this showed up in a agent outputs. But to a naive ingestion pipeline, it looks like a thread involving a known account with deal-relevant language. Exactly the kind of content it's designed to capture.
The misleading sender domain
A prospect's marketing team sends you a campaign routed through SendGrid, an account with an active opportunity. The domain in the headers reads sendgrid.net, not the prospect's actual domain. Your pipeline maps the company as SendGrid. Wrong account attribution. Wrong ingestion decision. And that error cascades into everything downstream: relationship intelligence, meeting briefs, agent workflows. All pointed at the wrong company.
Every one of these breaks a simple rule-based system. Deny lists and keyword filters don't work when you want to continue to track certain domains and the thread uses deal-related language.
The pipeline
The ingestion pipeline is a hierarchical funnel. Each stage is more capable than the last, and operate on the principles of data minimization and only access data on a need-to-know basis.
Deterministic filters including keyword filters and denylists eliminate the obvious violator found in everyone’s inboxes. The SendGrid email thread dies here with the organization blocking infrastructure domains. The customer complaint, the M&A thread, the legal hold all sail through.
Metadata sweep is the first LLM-based stage and it never reads the body. Using the subject line, participants, headers, timestamp and labels, this stage reasons over the email in the context of your organization to determine its sensitivity and relevance. The legal thread gets caught here with a subject line that mentions a contract dispute with a law firm as the sender. The M&A thread is harder with a vague subject and the customer complaint is the most difficult case as it discusses a key person involved in an active deal.
Full email sweep is the first stage where the body and attachments are actually retrieved and only runs on what survived everything above. The M&A thread gets resolved here with the sensitive acquisition language and gets classified as confidential and thus dropped. The customer complaint is an example of where this stage really excels. The body discusses an employee's behavior and raises performance concerns between an executive at a customer and an executive on your team. It now gets flagged and despite every prior stage saying it belonged there, it does not make it to the graph.
Decision log
Every stage writes to an audit log. What ran, what it decided, why so when a compliance officer (or yourself) asks "why wasn't this email ingested?" the answer is a queryable record with the stage level decision per email. The pipeline produces an auditable chain for every email it touches.

Figure 2 The end to end pipeline for the email ingestion consisting of initial bypass checks, deterministic filter stage and llm-based filtering stages.
What this unlocks
This system increases trust driven context sharing and ensures sensitive data never enters the graph. That same strict gatekeeping improves everything downstream providing cleaner meeting briefs, sharper relationship intelligence, and more reliable agent behavior.
The ingestion pipeline is one piece of how we think about data governance at Rox. It fits into the broader permission and governance layer we're building across the platform, which Harish covers in his deep dive.
If these are problems you want to work on, we're hiring.
Last updated

