Skip to content

API Overview

GraWiki exposes a layered API. Most users should start with GraphRAG, which provides ingestion, search, memory, and entity-deduplication workflows through one facade.

Use this section in the following order:

  1. GraphRAG for the high-level application surface.
  2. Retrieval pages for query-time search behavior.
  3. Graph model pages for persisted node and relationship shapes.
  4. Database abstractions when implementing or debugging a backend.
  5. Similarity and deduplication pages when inspecting duplicate entities or running merges.

At a high level, the API is split into facade-level entry points and lower-level implementation layers:

  • GraphRAG is the normal application surface.
  • Retrieval, graph, database, and similarity pages document the subsystems that GraphRAG composes.
  • The extraction and FalkorDB adapter pages are advanced reference material.
  • Helper modules with leading underscores are internal and intentionally undocumented here.

For task-oriented examples, use the How to section alongside this reference. The generated API sections are backed by docstrings from src/, so the reference stays aligned with the code that ships.

grawiki

Public GraWiki package surface.

The top-level package intentionally re-exports :class:grawiki.GraphRAG as the main entry point for users who want document ingestion, retrieval, memory, and entity-deduplication workflows through one facade.

GraphRAG

Orchestrate document ingestion and retrieval-augmented search.

The stepwise ingestion helpers exposed for notebooks and debugging are read_document(...), chunk_document(...), process_chunks(...), embed_document(...), embed_chunks(...), build_document_node(...), build_chunk_nodes(...), persist_document_and_chunks(...), extract_kg_per_chunk(...), and persist_entities_and_relationships(...).

Parameters:

Name Type Description Default
model str

Chat model used by the knowledge graph extractor.

required
embedding_model str

Embedding model used for documents, chunks, entities, and queries.

required
db GraphDB

Graph database adapter used for persistence and search.

required
chunking_strategy str

Chunking strategy passed to :class:~grawiki.doc_processing.chunkers.Chunker.

'sentence'
chunk_processors list[ChunkProcessor] | None

Optional chunk-level processing steps applied after chunking and before embedding and graph extraction. Useful for enrichment or normalization tasks such as question generation, entity anonymization, or metadata injection.

None
markdown_pipeline Pipeline | None

Optional markdown-aware pipeline used for markdown content. When omitted, markdown falls back to the generic text chunker.

None
max_workers int

Maximum number of concurrent chunk-level extraction coroutines.

4
embedding Embedding | None

Embedding override for tests or debugging.

None
kg_extractor KnowledgeGraphExtractorProtocol | None

Knowledge graph extractor override for tests or debugging.

None
kg_output_language str

Language used by the default knowledge graph extractor for entity names, relationship labels, and textual properties. Defaults to "English".

'English'
kg_extractor_kwargs dict[str, Any] | None

Extra keyword arguments forwarded to the default extractor's instructor.create(...) call when kg_extractor is omitted. Useful for provider-specific options such as reasoning_effort or max_retries.

None
similarity_finder EntitySimilarityFinder | None

Entity similarity finder used for collision inspection and candidate lookup. Defaults to a finder backed by the vector similarity matcher.

None
resolve_entities_on_ingest bool

When True, each freshly-extracted entity is compared against persisted entities before persistence. If a persisted entity is found whose cosine similarity exceeds entity_resolution_threshold, the extracted node is replaced by the persisted node and all relationship endpoints are rewritten accordingly. Defaults to False.

False
entity_resolution_threshold float

Minimum cosine-similarity score for two entities to be considered the same during ingest-time resolution. Only used when resolve_entities_on_ingest=True. Defaults to 0.92.

0.92

read_document

read_document(path)

Load one source document from disk.

Parameters:

Name Type Description Default
path Path

Filesystem path to the source document.

required

Returns:

Type Description
Document

Loaded source document.

chunk_document

chunk_document(document, format=None)

Split a document into chunks.

Parameters:

Name Type Description Default
document Document

Source document to segment.

required
format ('text', 'markdown')

Explicit content-format override. When omitted, the method uses document.metadata["content_format"] and falls back to "text".

"text"

Returns:

Type Description
list[Chunk]

Chunk sequence produced by the configured chunker. Markdown content uses the markdown pipeline only when one was configured explicitly.

process_chunks async

process_chunks(chunks)

Apply configured chunk processors in sequence.

Parameters:

Name Type Description Default
chunks list[Chunk]

Chunks to process.

required

Returns:

Type Description
list[Chunk]

Processed chunks in the same order as the input sequence.

embed_document async

embed_document(document)

Return no document-level embedding for ingestion.

Parameters:

Name Type Description Default
document Document

Source document. Kept in the signature for step-method API compatibility.

required

Returns:

Type Description
list[float]

Empty list. Document content is persisted without a vector; chunk, entity, memory, and query embeddings remain the retrieval path.

embed_chunks async

embed_chunks(chunks)

Embed chunk contents in one batch.

Parameters:

Name Type Description Default
chunks list[Chunk]

Chunks whose content should be embedded.

required

Returns:

Type Description
list[list[float]]

Embedding vectors aligned with the input chunk order.

build_document_node

build_document_node(document, embedding)

Build a document node with an optional embedding attached.

Parameters:

Name Type Description Default
document Document

Source document to convert into a persisted node model.

required
embedding list[float]

Optional embedding vector for the document. The ingestion path now passes an empty list, but the parameter is retained for API compatibility.

required

Returns:

Type Description
DocumentNode

Prepared document node ready for persistence.

build_chunk_nodes

build_chunk_nodes(chunks, embeddings)

Build chunk nodes with embeddings attached.

Parameters:

Name Type Description Default
chunks list[Chunk]

Source chunks to convert into persisted node models.

required
embeddings list[list[float]]

Embedding vectors aligned with chunks.

required

Returns:

Type Description
list[ChunkNode]

Prepared chunk nodes ready for persistence.

Raises:

Type Description
ValueError

Raised when the number of chunks and embeddings does not match.

persist_document_and_chunks async

persist_document_and_chunks(document_node, chunk_nodes)

Persist one document node and its chunk nodes with indexes.

Parameters:

Name Type Description Default
document_node DocumentNode

Prepared document node.

required
chunk_nodes list[ChunkNode]

Prepared chunk nodes associated with the document.

required

extract_kg_per_chunk async

extract_kg_per_chunk(chunks, *, show_progress=False)

Extract knowledge graphs for chunks with bounded concurrency.

Parameters:

Name Type Description Default
chunks list[Chunk]

Chunks to analyze.

required
show_progress bool

When True, emit info-level log messages as chunk extraction starts, each chunk finishes, and the overall extraction completes. Defaults to False.

False

Returns:

Type Description
dict[str, KnowledgeGraph]

Extracted graphs keyed by chunk identifier.

persist_entities_and_relationships async

persist_entities_and_relationships(owner_ids, owner_graphs)

Persist extracted entities and relationships.

Parameters:

Name Type Description Default
owner_ids Sequence[str]

Node identifiers that own the extracted graphs.

required
owner_graphs dict[str, KnowledgeGraph]

Extracted graphs keyed by owner identifier.

required

find_similar_entities async

find_similar_entities(entity, *, limit=10, threshold=None, candidates=None)

Return candidate entities similar to entity.

Parameters:

Name Type Description Default
entity Node

Source entity used as the similarity query.

required
limit int

Maximum number of candidate hits to return.

10
threshold float | None

Optional strategy-specific minimum score.

None
candidates list[Node] | None

Optional candidate pool. When omitted, persisted entities are loaded from the graph database.

None

Returns:

Type Description
list[NodeHit]

Ranked similarity candidates.

Notes

The configured :class:~grawiki.similarity.similarity_finder.EntitySimilarityFinder decides which concrete matcher implementation is used.

find_entity_collision_candidates async

find_entity_collision_candidates(*, limit=10, threshold=None)

Return semantic-key collision groups annotated with merge candidates.

Parameters:

Name Type Description Default
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional strategy-specific minimum score.

None

Returns:

Type Description
list[SemanticKeyCollisionCandidates]

Collision groups with per-entity candidate matches.

Notes

Candidate generation uses the similarity matcher configured on the injected entity similarity finder.

find_entity_duplicate_candidates async

find_entity_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)

Run the two-step duplicate-finding heuristic across entities.

Parameters:

Name Type Description Default
limit int

Maximum number of candidate hits returned per source entity.

10
threshold float | None

Optional matcher-specific minimum score.

None
skip_semantic_key_collisions_in_similarity_scan bool

Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions.

True

Returns:

Type Description
EntityDuplicateCandidates

Combined duplicate-candidate report produced by the injected entity similarity finder.

dedupe_entities async

dedupe_entities(*, limit=10, threshold=None, min_merge_score=0.95, dry_run=False)

Find duplicate entities and merge them into canonical masters.

Parameters:

Name Type Description Default
limit int

Maximum candidate hits returned per source entity during duplicate inspection.

10
threshold float | None

Optional similarity threshold forwarded to the duplicate finder.

None
min_merge_score float

Minimum candidate score required for inclusion in a merge group.

0.95
dry_run bool

When True, reports are produced without applying destructive DB changes.

False

Returns:

Type Description
list[MergeReport]

Reports describing the merge decisions that were made.

ingest async

ingest(path, *, show_progress=False)

Run the full ingestion flow for one file.

Parameters:

Name Type Description Default
path Path

Source file to ingest.

required
show_progress bool

When True, emit info-level progress logs for chunk-level knowledge-graph extraction during ingestion. Defaults to False.

False

Returns:

Type Description
None

This method persists the resulting graph side effects to the configured database.

ingest_text async

ingest_text(text, title, *, format='text', metadata=None, show_progress=False)

Ingest a document supplied as a string.

Parameters:

Name Type Description Default
text str

Document content to ingest.

required
title str

Human-readable document title used as the document name.

required
format ('text', 'markdown')

Explicit content format for the in-memory document. Defaults to "text" and is not auto-detected.

"text"
metadata dict[str, str] | None

Additional metadata attached to the transient source document before persistence.

None
show_progress bool

When True, emit info-level progress logs for chunk-level knowledge-graph extraction during ingestion. Defaults to False.

False

Returns:

Type Description
None

This method persists the resulting graph side effects to the configured database.

remember async

remember(memory, *, memory_id=None, name=None, semantic_key=None, metadata=None, related_node_ids=())

Persist one memory, replacing an existing memory when requested.

Parameters:

Name Type Description Default
memory MemoryNode | str

Memory payload to persist. Raw strings are normalized into a new :class:~grawiki.graph.models.MemoryNode.

required
memory_id str | None

Existing memory identifier to replace. When omitted, memory.id is used as-is.

None
name str | None

Optional memory name override. Primarily useful when memory is a raw string.

None
semantic_key str | None

Optional semantic key override. Defaults to the final memory id.

None
metadata dict[str, str] | None

Optional metadata merged into the memory metadata.

None
related_node_ids Sequence[str]

Existing node ids that should be explicitly linked from the memory.

()

Returns:

Type Description
MemoryNode

Persisted memory payload including its final id.

search async

search(query, *, limit=10)

Aggregate results from the configured retrievers.

Parameters:

Name Type Description Default
query str

Raw user query text.

required
limit int

Maximum number of final hits returned after combining retriever outputs.

10

Returns:

Type Description
list[NodeHit]

Flat, deduplicated search hits across the configured retrievers. With the default retriever set this typically includes chunk, memory, and keyword-expanded entity results.

Raises:

Type Description
RuntimeError

Raised when every configured retriever fails for the query.

recall async

recall(query, *, user_id=None, limit=5, hops=1, limit_per_hop=5)

Search memories and attach connected graph context.

Parameters:

Name Type Description Default
query str

Raw user query text.

required
user_id str | None

Optional memory-owner filter applied after memory retrieval.

None
limit int

Maximum number of memories returned.

5
hops int

Number of graph-expansion hops to include.

1
limit_per_hop int

Maximum recall paths expanded per memory seed.

5