Event Driven Framework

Introduction

Sellers today struggle with tedious research, missed updates from fragmented sources, and challenges in personalizing outreach due to unreliable data. At Rox, we integrate public and private data into a single view, in real-time, enabling sellers to engage more effectively and drive better results.

To achieve this, our systems must manage very high volumes of real-time data with a focus on concurrency and fault tolerance. We process thousands of updates per second from various data sources, including database queries and calls to large language models (LLMs). This demands an efficient, scalable, and robust framework.

Enter the Rox Task Framework.

This post dives into the engineering decisions and components that make our Task Framework reliable, modular, and capable of handling diverse workloads.

Why the Task Framework?

Our work involves processing various types of tasks - 64 and counting

These tasks include:

Querying databases and data warehouses.
Running Python code.
Scraping public data sources.
Calling external APIs, including LLMs.

Each task is a unit of work that must be processed efficiently without disrupting the broader pipeline.

Key requirements of the framework:

High concurrency and task distribution.
Graceful error handling and retries.
Centralized monitoring and observability.

Architecture Overview

The Rox Task Framework breaks down into six main components:

Task Publisher - Publishes tasks to the queue with the necessary metadata, such as priority and payload.
Task Queue - Uses Amazon SQS for queuing. This ensures decoupled processing and supports high-throughput workloads.
Task Listener - Deployed on ECS, the listener retrieves messages from the queue and initiates processing.
Task Handler - Determines the correct logic for a given task type
Task Executor - Executes the logic defined in the handler, managing retries and handling errors gracefully.
Task Run and Task Run Log - Stores metadata and state changes for each task run, providing complete traceability.

Key Features

Task Queuing and Distribution

The Task Queue, powered by SQS, allows for massive parallelism. Tasks are distributed across ECS for long-running operations or Lambda for short-lived workloads.

Error Handling and Retries

Failures are inevitable. Our framework employs:

Exponential backoff for retries.
Dead-letter queues (DLQs) for unrecoverable tasks.
Task-level isolation to prevent cascading failures.

Centralized Monitoring

We use Datadog an Sentry for monitoring. We track:

Task success and failure rates.
Queue depth and latency.
Executor performance metrics
Spans

This observability enables rapid debugging and optimization.

A Closer Look at Task Execution

Here’s a detailed breakdown of how a task flows through the framework, ensuring efficiency, reliability, and scalability:

Task Creation
1. The journey begins when a service publishes a task to the queue. This task represents a discrete unit of work, such as analyzing a news article, processing a dataset, or generating a customer-specific insight.
2. The task payload includes metadata like task type, priority, required resources, and any dependencies.
3. Example: A task to analyze a breaking news article might include the article URL, target audience, and analysis parameters.
Queuing
1. Once published, the task enters the Amazon SQS queue, which acts as the backbone for task orchestration.
2. The queue can be configured to enforce FIFO (First-In-First-Out) for ordered tasks or priority-based queuing for time-sensitive operations.
3. SQS ensures fault tolerance by storing tasks durably and retrying delivery if needed, ensuring no task is lost.
Listening
1. Amazon ECS (Elastic Container Service) workers continuously poll the queue for new tasks.
2. Each worker checks its availability and resource capacity before picking up a task. Workers are configured with autoscaling policies to adjust capacity based on the queue size, ensuring responsiveness during traffic spikes.
3. Example: If a sudden influx of tasks occurs (e.g., breaking news), the framework spins up additional workers to handle the load dynamically.
Handling
1. The task handler interprets the task type and executes the appropriate logic. Handlers are modular components designed to support diverse workloads.
2. For example: A task might involve querying a data warehouse (e.g., Snowflake or BigQuery) to fetch customer data. Another task might call an external API or leverage an LLM (Large Language Model) for text summarization.
3. This modularity allows seamless integration of new task types without disrupting existing workflows.
Execution
1. The executor is responsible for performing the actual work. During execution, the framework:
2. Logs progress and results: Ensuring transparency and traceability.
3. Handles retries: Automatically re-attempts failed tasks based on retry policies, such as exponential backoff.
4. Monitors performance: Alerts are triggered if execution times exceed thresholds or if error rates spike.
5. Example: An executor processing an LLM task might split the workload into smaller chunks for efficiency and parallel execution.
Completion
1. Once the task is successfully executed, its state is updated in the Task Run Log, a centralized system that tracks the lifecycle of every task.
2. The log captures details like start and end times, status (success, retry, or failure), and execution metadata for debugging and reporting.
3. The framework then marks the task as complete, allowing downstream processes or services to consume the results.

Challenges and Solutions

High Concurrency

Challenge:

With potentially hundreds of thousands of tasks flowing through the system simultaneously, scaling up to meet demand without bottlenecks was critical.

Solution:

Horizontal Scaling: The system dynamically increases the number of ECS workers and Lambda invocations during peak loads.
Auto-Scaling Policies: Configured to monitor queue size, CPU utilization, and memory consumption, ensuring resources are provisioned efficiently.
Load Distribution: Tasks are spread across multiple workers to avoid overloading a single node.
Example: During a high-volume event like a product launch, the system scales from tens to thousands of workers within minutes.

Error Isolation

Challenge:

A single failing task, such as a malformed payload or a timeout, could disrupt the entire queue or slow down critical operations.

Solution:

Strict Task Isolation: Each task runs in its own container or Lambda environment, ensuring errors are contained and do not cascade.
Dead Letter Queues (DLQs): Failed tasks are routed to a DLQ for later inspection and reprocessing, preventing retries from clogging the main queue.
Enhanced Monitoring: Real-time dashboards and alerts flag problematic tasks, enabling rapid intervention.

Diverse Task Types

Challenge:

The system needed to handle a wide variety of tasks with different complexities, runtimes, and resource requirements.

Solution:

Modular Architecture: Handlers and executors are designed as plug-and-play components, making it easy to add or update task logic.
Task Profiling: Tasks are categorized based on complexity and resource needs (e.g., CPU-intensive, I/O-heavy) to allocate appropriate resources.
Resource Pools: Separate pools of workers handle high-priority or resource-intensive tasks, ensuring no single task type monopolizes the system.
Example: An NLP-based task requiring GPU processing is routed to a specialized worker, while simpler tasks like data fetching are handled by standard instances.

These details not only enrich the blog but also showcase the robustness and flexibility of your framework. Let me know if you’d like more adjustments!

Why This Matters

The Rox Task Framework is designed to address critical engineering challenges:

High Throughput: Efficiently process massive volumes of real-time data without sacrificing performance.
Fast Delivery: Generate actionable insights quickly, ensuring users can make timely decisions.
Workload Versatility: Reliably handle diverse tasks, from lightweight operations to resource-intensive computations.

This framework is a practical solution to complex problems, enabling scalability, speed, and reliability while meeting the demands of modern data-driven systems.

Join Us

At Rox, we solve hard problems at scale. If designing and building systems like this excites you, we’d love to hear from you. Check out our careers page for more information.

PreviousRox Data Access NextAgent Swarm

Last updated 8 months ago