# Entity Graph

#### Introduction

The foundation of any effective Unified Knowledge Graph (UKG) lies in its ability to resolve entities across multiple data sources. Entity resolution ensures that data from CRM systems, ticketing platforms, product logs, and public data sources like ZoomInfo are accurately linked, de-duplicated, and contextualized.

This blog explores the step-by-step process of resolving entities and constructing the knowledge graph, highlighting algorithms, data flow, and real-world challenges.

***

#### The Challenge: Fragmented Data, Shared Identities

Entity resolution tackles the problem of fragmented identities across data sources. For example:

* A Salesforce account for "Oracle" might have an ID of `a` and a domain of `oracle.com`.
* The same organization in Zendesk might have an external ID of `1`, also linked to `oracle.com`.
* Product logs might refer to the same entity by a completely different mechanism

Without entity resolution, these would appear as separate entities, leading to inconsistency and duplication.

***

#### Entity Resolution in Four Steps

**Step 1: Build the Lookup Table Using IDs**

The simplest form of resolution if each of the system share IDs between them. Ex: Zendesk has an external ID field mapped to Salesforce

**Example Lookup Table**:

| **Source**    | **ID** | **Domain** | **External ID** | **Account ID** |
| ------------- | ------ | ---------- | --------------- | -------------- |
| Salesforce    | a      | oracle.com |                 |                |
| Zendesk       | 1      | oracle.com | a               |                |
| Product Usage |        |            |                 | one            |

***

**Step 2: Merge Data into the Knowledge Graph**

Using the lookup table:

* **No Existing Graph**: Create a new graph where each row in the lookup table is assigned a unique `rox_id` (the universal entity identifier).
* **Existing Graph**: Merge the lookup table with the existing graph using an outer join. Preserve new entities while dropping non-existent ones.

**Example Graph**:

| **rox\_id** | **Salesforce** | **Zendesk** | **Product Usage** |
| ----------- | -------------- | ----------- | ----------------- |
| uuid1       | a              | 1           | one               |

***

**Step 3: Resolve Relationships and Data Sources**

Once entities are linked:

* Assign a priority to relationships, ensuring data is merged in the correct order (e.g., Salesforce > Zendesk > Product Usage).
* Materialize `rox_id` mappings for each data source, creating a unified representation.

**Unified Representation**:

| **rox\_id** | **Data Source** | **Source ID** |
| ----------- | --------------- | ------------- |
| uuid1       | Salesforce      | a             |
| uuid1       | Zendesk         | 1             |
| uuid1       | Product Usage   | one           |

***

**Step 4: Materialize the Entities**

The final step involves resolving attributes like `domain`, `name`, or `email` for each entity:

1. Use ERM relationships to map fields from individual data sources.
2. Resolve conflicts using rules (e.g., prioritize Salesforce over Zendesk for domains).
3. Store the resolved values in the knowledge graph for downstream applications.

**Example Materialized Entity**:

| **Entity Type** | **rox\_id** | **Domain** | **Source System** |
| --------------- | ----------- | ---------- | ----------------- |
| Company         | uuid1       | oracle.com | Salesforce        |

***

#### Algorithms and Optimization

1. **Exact Match**&#x20;
   * Directly links IDs across systems.
   * Example: Zendesk external ID `a` maps to Salesforce ID `a`.
2. **Fuzzy Match**:
   * Uses fields like domains or emails for approximate matching.
   * Weighted matching (e.g., TF-IDF on `domain`) ensures accuracy for similar values.
3. **Priority-Based Resolution**:
   * Orders data sources based on trustworthiness or data quality.
   * Apply Advanced AI algorithms to spot the relevancy between entity records

***

#### Challenges in Entity Resolution

1. **Low Fidelity Data**:
   * Fields like `domain` or `email` might be incorrect or incomplete.
   * Example: A Salesforce entry for Databricks pointing to `https://spark.apache.org`.
2. **High Cardinality**:
   * Multiple results for a single query (e.g., ZoomInfo returning several potential matches).
   * Solution: Introduce a "User Feedback Required" step for ambiguous cases.
3. **Dynamic Updates**:
   * Ensuring real-time sync with new data sources while maintaining graph consistency.

Given the challenges above, we cannot simply persist the resolution, we need sometimes human in the loop to verify and confirm the associations, once confirmed, we maintain the mappings we constructed.&#x20;

***

#### Streaming Graphs

Even though Rox operates in internet scale data, the expectations are high to see entities and their relationship as quickly as possible, the graph build process is happening more in a batch fashion, but with advent of latest streaming technologies, Rox will looking to change capture and process deltas as fast as it detects, and materializes entity and relationships, We will explore this in future sections.

#### Conclusion

Entity resolution is the cornerstone of building a Unified Knowledge Graph. By linking, de-duplicating, and contextualizing data across diverse sources, the UKG enables seamless insights and intelligent automation. The process—though complex—ensures that organizations can leverage their data with accuracy and confidence.
