Matchers¶
Matchers provide the scoring strategy behind duplicate-candidate discovery.
EntitySimilarityMatcherdefines the protocol.VectorEntitySimilarityMatchercompares entity embeddings with cosine similarity.RapidFuzzEntitySimilarityMatchercompares entity names with string similarity.
grawiki.similarity.base
¶
Protocols for entity similarity matching.
EntitySimilarityMatcher
¶
Bases: Protocol
Protocol for entity-to-entity similarity matcher implementations.
search
async
¶
search(*, entity, limit=10, threshold=None, candidates=None)
Return ranked entity candidates for entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
candidates
|
Sequence[Node] | None
|
Optional pre-filtered candidate pool. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits for the source entity. |
grawiki.similarity.vector
¶
Embedding-based entity similarity matching.
VectorEntitySimilarityMatcher
¶
Match similar entities using cosine similarity over embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db
|
GraphDB
|
Graph database adapter used to enumerate persisted entities. |
required |
default_threshold
|
float
|
Minimum cosine similarity required when |
0.8
|
search
async
¶
search(*, entity, limit=10, threshold=None, candidates=None)
Return vector similarity hits for one entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Minimum cosine similarity required to keep a candidate. Defaults to
:attr: |
None
|
candidates
|
Sequence[Node] | None
|
Optional candidate pool. When omitted, all persisted entities with embeddings are loaded from the database. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits. |
Notes
Candidate scores use cosine similarity and therefore usually fall in
the range [-1, 1].
grawiki.similarity.fuzzy
¶
RapidFuzz-based entity similarity matching.
RapidFuzzEntitySimilarityMatcher
¶
Match similar entity names using RapidFuzz.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
db
|
GraphDB
|
Graph database adapter used to enumerate persisted entities. |
required |
default_threshold
|
float
|
Minimum similarity score required when |
90.0
|
search
async
¶
search(*, entity, limit=10, threshold=None, candidates=None)
Return RapidFuzz similarity hits for one entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Minimum score required to keep a candidate. Defaults to
:attr: |
None
|
candidates
|
Sequence[Node] | None
|
Optional candidate pool. When omitted, all persisted entities are loaded from the database. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked candidate hits. |
Notes
Candidate scores use :func:rapidfuzz.fuzz.WRatio and therefore fall
in the range [0, 100].