Entity Graph builder

Introduction

The foundation of any effective Unified Knowledge Graph (UKG) lies in its ability to resolve entities across multiple data sources. Entity resolution ensures that data from CRM systems, ticketing platforms, product logs, and public data sources like ZoomInfo are accurately linked, de-duplicated, and contextualized.

This blog explores the step-by-step process of resolving entities and constructing the knowledge graph, highlighting algorithms, data flow, and real-world challenges.


The Challenge: Fragmented Data, Shared Identities

Entity resolution tackles the problem of fragmented identities across data sources. For example:

  • A Salesforce account for "Oracle" might have an ID of a and a domain of oracle.com.

  • The same organization in Zendesk might have an external ID of 1, also linked to oracle.com.

  • Product logs might refer to the same entity by a completely different mechanism

Without entity resolution, these would appear as separate entities, leading to inconsistency and duplication.


Entity Resolution in Four Steps

Step 1: Build the Lookup Table Using IDs

The simplest form of resolution if each of the system share IDs between them. Ex: Zendesk has an external ID field mapped to Salesforce

Example Lookup Table:

Source

ID

Domain

External ID

Account ID

Salesforce

a

oracle.com

Zendesk

1

oracle.com

a

Product Usage

one


Step 2: Merge Data into the Knowledge Graph

Using the lookup table:

  • No Existing Graph: Create a new graph where each row in the lookup table is assigned a unique rox_id (the universal entity identifier).

  • Existing Graph: Merge the lookup table with the existing graph using an outer join. Preserve new entities while dropping non-existent ones.

Example Graph:

rox_id

Salesforce

Zendesk

Product Usage

uuid1

a

1

one


Step 3: Resolve Relationships and Data Sources

Once entities are linked:

  • Assign a priority to relationships, ensuring data is merged in the correct order (e.g., Salesforce > Zendesk > Product Usage).

  • Materialize rox_id mappings for each data source, creating a unified representation.

Unified Representation:

rox_id

Data Source

Source ID

uuid1

Salesforce

a

uuid1

Zendesk

1

uuid1

Product Usage

one


Step 4: Materialize the Entities

The final step involves resolving attributes like domain, name, or email for each entity:

  1. Use ERM relationships to map fields from individual data sources.

  2. Resolve conflicts using rules (e.g., prioritize Salesforce over Zendesk for domains).

  3. Store the resolved values in the knowledge graph for downstream applications.

Example Materialized Entity:

Entity Type

rox_id

Domain

Source System

Company

uuid1

oracle.com

Salesforce


Algorithms and Optimization

  1. Exact Match

    • Directly links IDs across systems.

    • Example: Zendesk external ID a maps to Salesforce ID a.

  2. Fuzzy Match:

    • Uses fields like domains or emails for approximate matching.

    • Weighted matching (e.g., TF-IDF on domain) ensures accuracy for similar values.

  3. Priority-Based Resolution:

    • Orders data sources based on trustworthiness or data quality.

    • Apply Advanced AI algorithms to spot the relevancy between entity records


Challenges in Entity Resolution

  1. Low Fidelity Data:

    • Fields like domain or email might be incorrect or incomplete.

    • Example: A Salesforce entry for Databricks pointing to https://spark.apache.org.

  2. High Cardinality:

    • Multiple results for a single query (e.g., ZoomInfo returning several potential matches).

    • Solution: Introduce a "User Feedback Required" step for ambiguous cases.

  3. Dynamic Updates:

    • Ensuring real-time sync with new data sources while maintaining graph consistency.

Given the challenges above, we cannot simply persist the resolution, we need sometimes human in the loop to verify and confirm the associations, once confirmed, we maintain the mappings we constructed.


Streaming Graphs

Even though Rox operates in internet scale data, the expectations are high to see entities and their relationship as quickly as possible, the graph build process is happening more in a batch fashion, but with advent of latest streaming technologies, Rox will looking to change capture and process deltas as fast as it detects, and materializes entity and relationships, We will explore this in future sections.

Conclusion

Entity resolution is the cornerstone of building a Unified Knowledge Graph. By linking, de-duplicating, and contextualizing data across diverse sources, the UKG enables seamless insights and intelligent automation. The processβ€”though complexβ€”ensures that organizations can leverage their data with accuracy and confidence.

Last updated

Logo

Copyright Β© 2024 RoxAI. All rights reserved. 251 Rhode Island St, Suite 207,
San Francisco, CA 94103