Skip to content

Similarity and Deduplication

GraWiki exposes entity deduplication as a user-facing workflow rather than only as low-level helpers.

The recommended progression is:

  1. Inspect exact semantic-key collisions with GraphRAG.find_entity_collision_candidates or EntitySimilarityFinder.
  2. Run the broader duplicate scan with GraphRAG.find_entity_duplicate_candidates.
  3. Execute merges with GraphRAG.dedupe_entities when the candidates are acceptable.

The duplicate workflow has two stages:

  • Exact collision detection by semantic_key.
  • Broader matcher-based scanning using vector or fuzzy similarity.

Supporting APIs you will usually need:

The same similarity stack also powers ingest-time entity resolution through GraphRAG when resolve_entities_on_ingest is enabled.

For a task-oriented walkthrough, see How to deduplicate entities.

grawiki.similarity

Entity similarity search and collision inspection.

EntitySimilarityMatcher

Bases: Protocol

Protocol for entity-to-entity similarity matcher implementations.

search async

search(*, entity, limit=10, threshold=None, candidates=None)

Return ranked entity candidates for entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Optional matcher-specific minimum score.

None
candidates Sequence[Node] | None

Optional pre-filtered candidate pool.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits for the source entity.

RapidFuzzEntitySimilarityMatcher

Match similar entity names using RapidFuzz.

Parameters:

Name Type Description Default
db GraphDB

Graph database adapter used to enumerate persisted entities.

required
default_threshold float

Minimum similarity score required when threshold is not provided to :meth:search.

90.0

search async

search(*, entity, limit=10, threshold=None, candidates=None)

Return RapidFuzz similarity hits for one entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Minimum score required to keep a candidate. Defaults to :attr:default_threshold.

None
candidates Sequence[Node] | None

Optional candidate pool. When omitted, all persisted entities are loaded from the database.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits.

Notes

Candidate scores use :func:rapidfuzz.fuzz.WRatio and therefore fall in the range [0, 100].

EntityDuplicateCandidates dataclass

Two-stage duplicate-candidate report for persisted entities.

Parameters:

Name Type Description Default
semantic_key_collisions dict[str, list[Node]]

Exact collision groups keyed by semantic key.

required
semantic_key_collision_candidates list[SemanticKeyCollisionCandidates]

Matcher-ranked candidates restricted to exact semantic-key collision groups.

required
similarity_candidates list[EntitySimilarityResult]

Matcher-ranked candidates found by the broader similarity scan.

required

EntitySimilarityResult dataclass

Similarity result set for one source entity.

Parameters:

Name Type Description Default
source Node

Entity node that was used as the similarity query.

required
hits list[NodeHit]

Ranked candidate matches for the source entity.

required

SemanticKeyCollisionCandidates dataclass

Similarity candidates generated for a duplicated semantic key group.

Parameters:

Name Type Description Default
semantic_key str

Semantic key shared by more than one persisted entity.

required
results list[EntitySimilarityResult]

Per-entity similarity results restricted to the collision group.

required

EntitySimilarityFinder

Inspect entity collisions and search for merge candidates.

Parameters:

Name Type Description Default
db GraphDB

Graph database adapter used to load persisted entity nodes.

required
matcher EntitySimilarityMatcher | None

Similarity matcher implementation used to produce candidate matches. Defaults to :class:~grawiki.similarity.vector.VectorEntitySimilarityMatcher.

None

find_semantic_key_collisions async

find_semantic_key_collisions(*, include_embeddings=False)

Return entity groups whose semantic key occurs more than once.

Parameters:

Name Type Description Default
include_embeddings bool

Whether the returned entities should include embeddings.

False

Returns:

Type Description
dict[str, list[Node]]

Entity groups keyed by semantic key, including only keys with more than one entity.

Notes

This method is intended as a lightweight integrity check before running more expensive similarity matching.

search async

search(entity, *, limit=10, threshold=None, candidates=None)

Return similarity candidates for one entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Optional strategy-specific minimum score.

None
candidates list[Node] | None

Optional pre-filtered candidate pool.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits.

find_collision_candidates async

find_collision_candidates(*, collisions=None, limit=10, threshold=None)

Run similarity search inside semantic-key collision groups.

Parameters:

Name Type Description Default
collisions dict[str, list[Node]] | None

Precomputed semantic-key collision groups. When omitted, collisions are loaded from the database.

None
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional strategy-specific minimum score.

None

Returns:

Type Description
list[SemanticKeyCollisionCandidates]

Collision groups with per-entity candidate matches.

Notes

Similarity matching is restricted to members of each collision group so that the output is safe to use as a merge-candidate inspection aid.

find_similarity_candidates async

find_similarity_candidates(*, limit=10, threshold=None, entities=None, skip_semantic_key_collisions=False, semantic_key_collisions=None)

Run a broader matcher-based duplicate scan across entities.

Parameters:

Name Type Description Default
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional matcher-specific minimum score.

None
entities list[Node] | None

Optional explicit entity pool. When omitted, persisted entities are loaded from the database with embeddings included.

None
skip_semantic_key_collisions bool

Whether entities already participating in an exact semantic-key collision should be excluded from this broader similarity scan.

False
semantic_key_collisions dict[str, list[Node]] | None

Precomputed semantic-key collision groups used when skip_semantic_key_collisions is enabled. When omitted, the groups are loaded from the database.

None

Returns:

Type Description
list[EntitySimilarityResult]

Ranked similarity candidates grouped by source entity.

Notes

Each entity pair is considered at most once by searching only against entities that appear later in the candidate order.

find_duplicate_candidates async

find_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)

Run the two-step duplicate-finding heuristic across entities.

Parameters:

Name Type Description Default
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional matcher-specific minimum score.

None
skip_semantic_key_collisions_in_similarity_scan bool

Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions.

True

Returns:

Type Description
EntityDuplicateCandidates

Combined report containing exact collisions, verified collision-group candidates, and broader matcher-based similarity candidates.

VectorEntitySimilarityMatcher

Match similar entities using cosine similarity over embeddings.

Parameters:

Name Type Description Default
db GraphDB

Graph database adapter used to enumerate persisted entities.

required
default_threshold float

Minimum cosine similarity required when threshold is not provided to :meth:search.

0.8

search async

search(*, entity, limit=10, threshold=None, candidates=None)

Return vector similarity hits for one entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Minimum cosine similarity required to keep a candidate. Defaults to :attr:default_threshold.

None
candidates Sequence[Node] | None

Optional candidate pool. When omitted, all persisted entities with embeddings are loaded from the database.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits.

Notes

Candidate scores use cosine similarity and therefore usually fall in the range [-1, 1].