Entity Graph builder
Introduction
The foundation of any effective Unified Knowledge Graph (UKG) lies in its ability to resolve entities across multiple data sources. Entity resolution ensures that data from CRM systems, ticketing platforms, product logs, and public data sources like ZoomInfo are accurately linked, de-duplicated, and contextualized.
This blog explores the step-by-step process of resolving entities and constructing the knowledge graph, highlighting algorithms, data flow, and real-world challenges.
The Challenge: Fragmented Data, Shared Identities
Entity resolution tackles the problem of fragmented identities across data sources. For example:
A Salesforce account for "Oracle" might have an ID of
a
and a domain oforacle.com
.The same organization in Zendesk might have an external ID of
1
, also linked tooracle.com
.Product logs might refer to the same entity by a completely different mechanism
Without entity resolution, these would appear as separate entities, leading to inconsistency and duplication.
Entity Resolution in Four Steps
Step 1: Build the Lookup Table Using IDs
The simplest form of resolution if each of the system share IDs between them. Ex: Zendesk has an external ID field mapped to Salesforce
Example Lookup Table:
Source | ID | Domain | External ID | Account ID |
Salesforce | a | oracle.com | ||
Zendesk | 1 | oracle.com | a | |
Product Usage | one |
Step 2: Merge Data into the Knowledge Graph
Using the lookup table:
No Existing Graph: Create a new graph where each row in the lookup table is assigned a unique
rox_id
(the universal entity identifier).Existing Graph: Merge the lookup table with the existing graph using an outer join. Preserve new entities while dropping non-existent ones.
Example Graph:
rox_id | Salesforce | Zendesk | Product Usage |
uuid1 | a | 1 | one |
Step 3: Resolve Relationships and Data Sources
Once entities are linked:
Assign a priority to relationships, ensuring data is merged in the correct order (e.g., Salesforce > Zendesk > Product Usage).
Materialize
rox_id
mappings for each data source, creating a unified representation.
Unified Representation:
rox_id | Data Source | Source ID |
uuid1 | Salesforce | a |
uuid1 | Zendesk | 1 |
uuid1 | Product Usage | one |
Step 4: Materialize the Entities
The final step involves resolving attributes like domain
, name
, or email
for each entity:
Use ERM relationships to map fields from individual data sources.
Resolve conflicts using rules (e.g., prioritize Salesforce over Zendesk for domains).
Store the resolved values in the knowledge graph for downstream applications.
Example Materialized Entity:
Entity Type | rox_id | Domain | Source System |
Company | uuid1 | oracle.com | Salesforce |
Algorithms and Optimization
Exact Match
Directly links IDs across systems.
Example: Zendesk external ID
a
maps to Salesforce IDa
.
Fuzzy Match:
Uses fields like domains or emails for approximate matching.
Weighted matching (e.g., TF-IDF on
domain
) ensures accuracy for similar values.
Priority-Based Resolution:
Orders data sources based on trustworthiness or data quality.
Apply Advanced AI algorithms to spot the relevancy between entity records
Challenges in Entity Resolution
Low Fidelity Data:
Fields like
domain
oremail
might be incorrect or incomplete.Example: A Salesforce entry for Databricks pointing to
https://spark.apache.org
.
High Cardinality:
Multiple results for a single query (e.g., ZoomInfo returning several potential matches).
Solution: Introduce a "User Feedback Required" step for ambiguous cases.
Dynamic Updates:
Ensuring real-time sync with new data sources while maintaining graph consistency.
Given the challenges above, we cannot simply persist the resolution, we need sometimes human in the loop to verify and confirm the associations, once confirmed, we maintain the mappings we constructed.
Streaming Graphs
Even though Rox operates in internet scale data, the expectations are high to see entities and their relationship as quickly as possible, the graph build process is happening more in a batch fashion, but with advent of latest streaming technologies, Rox will looking to change capture and process deltas as fast as it detects, and materializes entity and relationships, We will explore this in future sections.
Conclusion
Entity resolution is the cornerstone of building a Unified Knowledge Graph. By linking, de-duplicating, and contextualizing data across diverse sources, the UKG enables seamless insights and intelligent automation. The processβthough complexβensures that organizations can leverage their data with accuracy and confidence.
Last updated