Deduplicate entities¶

This guide shows the recommended order for reviewing and merging duplicate entities. Start with inspection, review the proposed matches, then apply merges only after you are satisfied with the candidate groups.

Inspect duplicate candidates¶

Run find_entity_duplicate_candidates to get a combined report that covers both exact semantic-key collisions and broader similarity-based candidates in one call.

duplicate_candidates = await rag.find_entity_duplicate_candidates(
    limit=5,
    threshold=0.9,
)

The returned EntityDuplicateCandidates object contains:

semantic_key_collision_candidates — entities that already share the same semantic_key.
similarity_candidates — additional matches found by the configured similarity matcher.

Review semantic_key_collision_candidates first; those are the safest merges.

Run a dry merge first¶

Use dry_run=True before applying destructive changes.

dry_run_reports = await rag.dedupe_entities(
    limit=5,
    threshold=0.9,
    min_merge_score=0.95,
    dry_run=True,
)

Review the returned MergeReport objects before proceeding. In particular, check the chosen master node, duplicate ids, merged labels, and property conflicts.

Apply merges¶

Apply merges only after reviewing the dry-run output.

applied_reports = await rag.dedupe_entities(
    limit=5,
    threshold=0.9,
    min_merge_score=0.95,
    dry_run=False,
)

This step updates the graph and removes duplicate entity nodes.

Safety notes¶

Prefer reviewing collision groups before broader similarity groups.
Use conservative thresholds until you understand the shape of your data.
Treat dry_run=False as a destructive operation and keep it out of exploratory notebooks until the candidate sets look correct.

For the lower-level types involved in this workflow, see Similarity and Deduplication.