Content Ingestion for Public Data

Building a Scalable Data Pipeline for Real-Time Company Insights: Challenges, Design, and Solutions

In today’s fast-paced business environment, timely and accurate information is critical for sales teams to stay competitive. This blog explores how we at Rox designed a scalable, reliable system to collect, process, and surface relevant public company data to empower sellers called the Data Extraction Pipeline.

Problem Statement: Delivering Real-Time Company Data

Sellers use public data to identify high-value leads, craft personalized outreach, and align their solutions with a company's needs. By continuously monitoring this data, sellers refine their strategies and engage in more relevant, context-driven conversations, improving their chances of closing deals. However, the sheer volume of available data can be overwhelming and difficult to navigate. The Rox Platform streamlines this process, transforming the flood of information into easily digestible insights through key features, such as:

  1. Insights: Surface relevant company information to assist in sales strategies and decision-making.

  2. Account Overview: Generates pain hypothesis and corporate objectives in the account overview page and is used as context for agent processing.

  3. Teams & Technologies: Extract and process job postings to gather data on company team structures and technology stacks.

  4. Pipeline Generation Actions (Pipe Gen): Leverage insights for automated prospecting actions.

Now, to dive into the details. The Data Extraction Pipeline needs to refresh data for over 100k companies daily, with thousands of companies added each day, ensuring that every public artifact is scraped and processed. Currently, the pipeline supports the following artifact types:

Data Categories and Platform Features

  1. Company Information - Company profiles provide key insights into a company’s market position and growth potential, helping sellers prioritize prospects and align outreach with strategic goals. Historical business data offers additional context, especially in industries like media, where past projects signal future direction.

  2. News Articles - Real-time news updates on leadership changes, product launches, or corporate challenges enable sellers to engage with timely, relevant messaging that aligns with a company’s current focus.

  3. Job Postings - Job postings reveal expansion areas, skill priorities, and team structure, helping sellers target departments in need of new solutions based on hiring trends.

  4. Financial Documents - Financial documents provide insights into a company’s financial health and readiness to invest, helping sellers tailor outreach based on the company’s financial position and strategic priorities.

  5. Blog and Newsroom Links - Company blogs and newsroom updates offer direct insight into a company’s latest initiatives and goals, allowing sellers to create personalized outreach that reflects a deep understanding of the company’s direction.

By consolidating and presenting this data in a structured, actionable way, the Rox Platform empowers sellers to work more efficiently and strategically, ultimately increasing their chances of closing deals.

System Design: Components and Workflow

The Data Extraction Pipeline is made up of several key components:

1. PSQL Database for Tracking Requests

At the heart of our system is the PostgreSQL (PSQL) database, which tracks and monitors requests for data extraction. This database acts as a central hub to maintain the state of each request throughout its lifecycle, ensuring that no request is missed or duplicated.

2. Rox-Core

Rox-Core is a monolithic application hosted on ECS that serves as the backbone of the Rox platform, powering its core features and functionality. One of its key responsibilities is managing and scheduling data extraction requests (i.e. get all news articles related to Rox over the past day). Each data extraction request is handled by Rox-Core and directed to one of many SQS queues, which is partitioned based on the artifact type (e.g., news, job postings) and the extraction mode (on-demand or batch). This partitioning strategy ensures that batch processing and real-time data extraction remain separate, preventing any interference between the two processes.

3. AWS Lambdas: Coordinators and Scrapers

On the receiving end of each SQS queue are AWS Lambda functions which extracts and processes each request. There are two types of Lambda functions:

  • Coordinator Lambda: This function is responsible for identifying which data needs to be extracted. For instance, it would retrieve all news links related to a company (e.g., rox.com) published in the past day. The coordinator checks various curated data sources for new content and gathers a list of urls for the scraper lambda to process. Before that, the request is sent to Rox-Core via SQS queue for rate limiting and state management.

  • Scraper Lambda: This function is responsible for extracting and validating the content. It enforces checks for relevancy (using LLMs) and timeliness (validating the dates of articles and posts). The scrapers process data and store raw content in S3, while metadata is stored in DynamoDB for faster lookups.

4. S3 for Unstructured Data Storage

Given the large and unstructured nature of content like news articles and job postings, Amazon S3 was chosen for storage. S3 is highly scalable and allows us to store content without worrying about the underlying structure. This content is processed and metadata is stored in DynamoDB, which is optimized for high-throughput and low-latency access.

5. Event-Driven Architecture

Every time data is written to S3, an event is triggered and sent to another SQS queue. Another AWS lambda function picks up the event, processes it, and distributes the information to the appropriate feature listeners (e.g., Insights, Teams & Technologies, etc.) within Rox-Core for sellers to ingest.

Key Challenges and Solutions

Building a scalable and reliable data extraction pipeline came with its share of challenges. Below are some of the key hurdles we faced and how we overcame them:

1. Rate Limiting

One of the most significant bottlenecks in our pipeline was rate limiting, as the system made thousands of API calls every minute. The difficulty arose from coordinating rate limits between Rox-Core and AWS Lambdas, both of which used the same API keys.

We initially used Redis as a centralized rate limiter. However, scaling Redis for thousands of concurrent Lambda executions proved problematic due to connection pooling issues. As a result, we shifted rate limiting responsibility to Rox-Core, where we could better control the number of Redis connections by using a connection pool.

2. Data Accuracy with LLMs

A challenge in using Large Language Models (LLMs) for data extraction was ensuring data accuracy. Determining whether an article was relevant to a company or verifying the correct "posted on" date proved difficult, even when using external NLP tools like htmldate and news please.

The solution involved fine-tuning our LLM-based check for relevancy by passing the model both the company’s information and the article content. We also specified prompts to help the LLM reason about the accuracy of dates mentioned in the text, improving timeliness checks.

3. Updating State Management

Maintaining state throughout the data extraction pipeline posed challenges, as the Lambdas weren't directly configured to commit transactions to the PSQL database. Instead, they had to update the database via HTTP requests to Rox-Core. This approach led to scalability and reliability issues when processing over 50,000 companies, as the simultaneous influx of requests overloaded Rox-Core's load balancer, causing intermittent 502 / 429 errors. The solution was to eliminate HTTP requests for database updates and instead queue messages in SQS, with a listener in Rox-Core to process the updates efficiently.

Opportunities for Future Improvements

While the current system is robust and scalable, there are still opportunities for improvement:

  1. Enhanced Data Validation: There is ongoing work to improve the relevance checks performed by LLMs, especially in terms of parsing dates more accurately and enhancing our models’ ability to understand company-specific contexts.

  2. Pushing to a Fully SQL-Driven Data Layer: As the volume of data grows, we plan to migrate metadata storage from DynamoDB to PostgreSQL to unify our data access layer. This will simplify querying across all types of company data and improve our analytics capabilities.

Conclusion

Building a real-time data extraction pipeline at scale is no easy task. By combining a distributed architecture using AWS Lambda, SQS, and S3, we’ve been able to create a system that scales to meet the demands of the Rox platform. Although challenges remain—especially around rate limiting and data accuracy—the system has proven to be a valuable asset for delivering timely and relevant company insights to sales teams. With ongoing improvements, we’re excited to see how the platform can continue to evolve to meet the needs of its users.

Last updated