Skip to content

Matchers

Matchers provide the scoring strategy behind duplicate-candidate discovery.

grawiki.similarity.base

Protocols for entity similarity matching.

EntitySimilarityMatcher

Bases: Protocol

Protocol for entity-to-entity similarity matcher implementations.

search async

search(*, entity, limit=10, threshold=None, candidates=None)

Return ranked entity candidates for entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Optional matcher-specific minimum score.

None
candidates Sequence[Node] | None

Optional pre-filtered candidate pool.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits for the source entity.

grawiki.similarity.vector

Embedding-based entity similarity matching.

VectorEntitySimilarityMatcher

Match similar entities using cosine similarity over embeddings.

Parameters:

Name Type Description Default
db GraphDB

Graph database adapter used to enumerate persisted entities.

required
default_threshold float

Minimum cosine similarity required when threshold is not provided to :meth:search.

0.8

search async

search(*, entity, limit=10, threshold=None, candidates=None)

Return vector similarity hits for one entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Minimum cosine similarity required to keep a candidate. Defaults to :attr:default_threshold.

None
candidates Sequence[Node] | None

Optional candidate pool. When omitted, all persisted entities with embeddings are loaded from the database.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits.

Notes

Candidate scores use cosine similarity and therefore usually fall in the range [-1, 1].

grawiki.similarity.fuzzy

RapidFuzz-based entity similarity matching.

RapidFuzzEntitySimilarityMatcher

Match similar entity names using RapidFuzz.

Parameters:

Name Type Description Default
db GraphDB

Graph database adapter used to enumerate persisted entities.

required
default_threshold float

Minimum similarity score required when threshold is not provided to :meth:search.

90.0

search async

search(*, entity, limit=10, threshold=None, candidates=None)

Return RapidFuzz similarity hits for one entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Minimum score required to keep a candidate. Defaults to :attr:default_threshold.

None
candidates Sequence[Node] | None

Optional candidate pool. When omitted, all persisted entities are loaded from the database.

None

Returns:

Type Description
list[NodeHit]

Ranked candidate hits.

Notes

Candidate scores use :func:rapidfuzz.fuzz.WRatio and therefore fall in the range [0, 100].