Similarity Finder¶
EntitySimilarityFinder orchestrates both exact semantic-key collision checks and broader matcher-based similarity scans. It is the best low-level entry point when you need duplicate inspection without going through the full GraphRAG facade.
grawiki.similarity.similarity_finder
¶
High-level entity similarity orchestration.
EntitySimilarityFinder
¶
Inspect entity collisions and search for merge candidates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db
|
GraphDB
|
Graph database adapter used to load persisted entity nodes. |
required |
matcher
|
EntitySimilarityMatcher | None
|
Similarity matcher implementation used to produce candidate matches.
Defaults to :class: |
None
|
find_semantic_key_collisions
async
¶
find_semantic_key_collisions(*, include_embeddings=False)
Return entity groups whose semantic key occurs more than once.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_embeddings
|
bool
|
Whether the returned entities should include embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, list[Node]]
|
Entity groups keyed by semantic key, including only keys with more than one entity. |
Notes
This method is intended as a lightweight integrity check before running more expensive similarity matching.
search
async
¶
search(entity, *, limit=10, threshold=None, candidates=None)
Return similarity candidates for one entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Optional strategy-specific minimum score. |
None
|
candidates
|
list[Node] | None
|
Optional pre-filtered candidate pool. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits. |
find_collision_candidates
async
¶
find_collision_candidates(*, collisions=None, limit=10, threshold=None)
Run similarity search inside semantic-key collision groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collisions
|
dict[str, list[Node]] | None
|
Precomputed semantic-key collision groups. When omitted, collisions are loaded from the database. |
None
|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional strategy-specific minimum score. |
None
|
Returns:
| Type | Description |
|---|---|
list[SemanticKeyCollisionCandidates]
|
Collision groups with per-entity candidate matches. |
Notes
Similarity matching is restricted to members of each collision group so that the output is safe to use as a merge-candidate inspection aid.
find_similarity_candidates
async
¶
find_similarity_candidates(*, limit=10, threshold=None, entities=None, skip_semantic_key_collisions=False, semantic_key_collisions=None)
Run a broader matcher-based duplicate scan across entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
entities
|
list[Node] | None
|
Optional explicit entity pool. When omitted, persisted entities are loaded from the database with embeddings included. |
None
|
skip_semantic_key_collisions
|
bool
|
Whether entities already participating in an exact semantic-key collision should be excluded from this broader similarity scan. |
False
|
semantic_key_collisions
|
dict[str, list[Node]] | None
|
Precomputed semantic-key collision groups used when
|
None
|
Returns:
| Type | Description |
|---|---|
list[EntitySimilarityResult]
|
Ranked similarity candidates grouped by source entity. |
Notes
Each entity pair is considered at most once by searching only against entities that appear later in the candidate order.
find_duplicate_candidates
async
¶
find_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)
Run the two-step duplicate-finding heuristic across entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
skip_semantic_key_collisions_in_similarity_scan
|
bool
|
Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions. |
True
|
Returns:
| Type | Description |
|---|---|
EntityDuplicateCandidates
|
Combined report containing exact collisions, verified collision-group candidates, and broader matcher-based similarity candidates. |