Similarity and Deduplication¶
GraWiki exposes entity deduplication as a user-facing workflow rather than only as low-level helpers.
The recommended progression is:
- Inspect exact semantic-key collisions with
GraphRAG.find_entity_collision_candidatesorEntitySimilarityFinder. - Run the broader duplicate scan with
GraphRAG.find_entity_duplicate_candidates. - Execute merges with
GraphRAG.dedupe_entitieswhen the candidates are acceptable.
The duplicate workflow has two stages:
- Exact collision detection by
semantic_key. - Broader matcher-based scanning using vector or fuzzy similarity.
Supporting APIs you will usually need:
EntitySimilarityFinderEntitySimilarityMatcherVectorEntitySimilarityMatcherRapidFuzzEntitySimilarityMatcherEntityDuplicateCandidatesSemanticKeyCollisionCandidatesMergeReport
The same similarity stack also powers ingest-time entity resolution through GraphRAG when resolve_entities_on_ingest is enabled.
For a task-oriented walkthrough, see How to deduplicate entities.
grawiki.similarity
¶
Entity similarity search and collision inspection.
EntitySimilarityMatcher
¶
Bases: Protocol
Protocol for entity-to-entity similarity matcher implementations.
search
async
¶
search(*, entity, limit=10, threshold=None, candidates=None)
Return ranked entity candidates for entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
candidates
|
Sequence[Node] | None
|
Optional pre-filtered candidate pool. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits for the source entity. |
RapidFuzzEntitySimilarityMatcher
¶
Match similar entity names using RapidFuzz.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db
|
GraphDB
|
Graph database adapter used to enumerate persisted entities. |
required |
default_threshold
|
float
|
Minimum similarity score required when |
90.0
|
search
async
¶
search(*, entity, limit=10, threshold=None, candidates=None)
Return RapidFuzz similarity hits for one entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Minimum score required to keep a candidate. Defaults to
:attr: |
None
|
candidates
|
Sequence[Node] | None
|
Optional candidate pool. When omitted, all persisted entities are loaded from the database. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits. |
Notes
Candidate scores use :func:rapidfuzz.fuzz.WRatio and therefore fall
in the range [0, 100].
EntityDuplicateCandidates
dataclass
¶
Two-stage duplicate-candidate report for persisted entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
semantic_key_collisions
|
dict[str, list[Node]]
|
Exact collision groups keyed by semantic key. |
required |
semantic_key_collision_candidates
|
list[SemanticKeyCollisionCandidates]
|
Matcher-ranked candidates restricted to exact semantic-key collision groups. |
required |
similarity_candidates
|
list[EntitySimilarityResult]
|
Matcher-ranked candidates found by the broader similarity scan. |
required |
EntitySimilarityResult
dataclass
¶
SemanticKeyCollisionCandidates
dataclass
¶
Similarity candidates generated for a duplicated semantic key group.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
semantic_key
|
str
|
Semantic key shared by more than one persisted entity. |
required |
results
|
list[EntitySimilarityResult]
|
Per-entity similarity results restricted to the collision group. |
required |
EntitySimilarityFinder
¶
Inspect entity collisions and search for merge candidates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db
|
GraphDB
|
Graph database adapter used to load persisted entity nodes. |
required |
matcher
|
EntitySimilarityMatcher | None
|
Similarity matcher implementation used to produce candidate matches.
Defaults to :class: |
None
|
find_semantic_key_collisions
async
¶
find_semantic_key_collisions(*, include_embeddings=False)
Return entity groups whose semantic key occurs more than once.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_embeddings
|
bool
|
Whether the returned entities should include embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, list[Node]]
|
Entity groups keyed by semantic key, including only keys with more than one entity. |
Notes
This method is intended as a lightweight integrity check before running more expensive similarity matching.
search
async
¶
search(entity, *, limit=10, threshold=None, candidates=None)
Return similarity candidates for one entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Optional strategy-specific minimum score. |
None
|
candidates
|
list[Node] | None
|
Optional pre-filtered candidate pool. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits. |
find_collision_candidates
async
¶
find_collision_candidates(*, collisions=None, limit=10, threshold=None)
Run similarity search inside semantic-key collision groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collisions
|
dict[str, list[Node]] | None
|
Precomputed semantic-key collision groups. When omitted, collisions are loaded from the database. |
None
|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional strategy-specific minimum score. |
None
|
Returns:
| Type | Description |
|---|---|
list[SemanticKeyCollisionCandidates]
|
Collision groups with per-entity candidate matches. |
Notes
Similarity matching is restricted to members of each collision group so that the output is safe to use as a merge-candidate inspection aid.
find_similarity_candidates
async
¶
find_similarity_candidates(*, limit=10, threshold=None, entities=None, skip_semantic_key_collisions=False, semantic_key_collisions=None)
Run a broader matcher-based duplicate scan across entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
entities
|
list[Node] | None
|
Optional explicit entity pool. When omitted, persisted entities are loaded from the database with embeddings included. |
None
|
skip_semantic_key_collisions
|
bool
|
Whether entities already participating in an exact semantic-key collision should be excluded from this broader similarity scan. |
False
|
semantic_key_collisions
|
dict[str, list[Node]] | None
|
Precomputed semantic-key collision groups used when
|
None
|
Returns:
| Type | Description |
|---|---|
list[EntitySimilarityResult]
|
Ranked similarity candidates grouped by source entity. |
Notes
Each entity pair is considered at most once by searching only against entities that appear later in the candidate order.
find_duplicate_candidates
async
¶
find_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)
Run the two-step duplicate-finding heuristic across entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
skip_semantic_key_collisions_in_similarity_scan
|
bool
|
Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions. |
True
|
Returns:
| Type | Description |
|---|---|
EntityDuplicateCandidates
|
Combined report containing exact collisions, verified collision-group candidates, and broader matcher-based similarity candidates. |
VectorEntitySimilarityMatcher
¶
Match similar entities using cosine similarity over embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db
|
GraphDB
|
Graph database adapter used to enumerate persisted entities. |
required |
default_threshold
|
float
|
Minimum cosine similarity required when |
0.8
|
search
async
¶
search(*, entity, limit=10, threshold=None, candidates=None)
Return vector similarity hits for one entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Minimum cosine similarity required to keep a candidate. Defaults to
:attr: |
None
|
candidates
|
Sequence[Node] | None
|
Optional candidate pool. When omitted, all persisted entities with embeddings are loaded from the database. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits. |
Notes
Candidate scores use cosine similarity and therefore usually fall in
the range [-1, 1].