Skip to content

Similarity Finder

EntitySimilarityFinder orchestrates both exact semantic-key collision checks and broader matcher-based similarity scans. It is the best low-level entry point when you need duplicate inspection without going through the full GraphRAG facade.

grawiki.similarity.similarity_finder

High-level entity similarity orchestration.

EntitySimilarityFinder

Inspect entity collisions and search for merge candidates.

Parameters:

Name Type Description Default
db GraphDB

Graph database adapter used to load persisted entity nodes.

required
matcher EntitySimilarityMatcher | None

Similarity matcher implementation used to produce candidate matches. Defaults to :class:~grawiki.similarity.vector.VectorEntitySimilarityMatcher.

None

find_semantic_key_collisions async

find_semantic_key_collisions(*, include_embeddings=False)

Return entity groups whose semantic key occurs more than once.

Parameters:

Name Type Description Default
include_embeddings bool

Whether the returned entities should include embeddings.

False

Returns:

Type Description
dict[str, list[Node]]

Entity groups keyed by semantic key, including only keys with more than one entity.

Notes

This method is intended as a lightweight integrity check before running more expensive similarity matching.

search async

search(entity, *, limit=10, threshold=None, candidates=None)

Return similarity candidates for one entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Optional strategy-specific minimum score.

None
candidates list[Node] | None

Optional pre-filtered candidate pool.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits.

find_collision_candidates async

find_collision_candidates(*, collisions=None, limit=10, threshold=None)

Run similarity search inside semantic-key collision groups.

Parameters:

Name Type Description Default
collisions dict[str, list[Node]] | None

Precomputed semantic-key collision groups. When omitted, collisions are loaded from the database.

None
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional strategy-specific minimum score.

None

Returns:

Type Description
list[SemanticKeyCollisionCandidates]

Collision groups with per-entity candidate matches.

Notes

Similarity matching is restricted to members of each collision group so that the output is safe to use as a merge-candidate inspection aid.

find_similarity_candidates async

find_similarity_candidates(*, limit=10, threshold=None, entities=None, skip_semantic_key_collisions=False, semantic_key_collisions=None)

Run a broader matcher-based duplicate scan across entities.

Parameters:

Name Type Description Default
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional matcher-specific minimum score.

None
entities list[Node] | None

Optional explicit entity pool. When omitted, persisted entities are loaded from the database with embeddings included.

None
skip_semantic_key_collisions bool

Whether entities already participating in an exact semantic-key collision should be excluded from this broader similarity scan.

False
semantic_key_collisions dict[str, list[Node]] | None

Precomputed semantic-key collision groups used when skip_semantic_key_collisions is enabled. When omitted, the groups are loaded from the database.

None

Returns:

Type Description
list[EntitySimilarityResult]

Ranked similarity candidates grouped by source entity.

Notes

Each entity pair is considered at most once by searching only against entities that appear later in the candidate order.

find_duplicate_candidates async

find_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)

Run the two-step duplicate-finding heuristic across entities.

Parameters:

Name Type Description Default
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional matcher-specific minimum score.

None
skip_semantic_key_collisions_in_similarity_scan bool

Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions.

True

Returns:

Type Description
EntityDuplicateCandidates

Combined report containing exact collisions, verified collision-group candidates, and broader matcher-based similarity candidates.