GraphRAG¶

GraphRAG is the main public facade. It combines document ingestion, chunk-level graph extraction, retrieval, agent-memory persistence, and duplicate-entity inspection in one class.

GraphRAG always keeps the generic text chunker ready and can optionally add a markdown-aware pipeline path:

the generic text Chunker for plain-text content,
an explicit markdown pipeline adapter for markdown content when markdown_pipeline= is provided.

read_document(...) is the only file-format detection step. It marks .md and .markdown files as markdown content, converts .pdf files to markdown in memory, and leaves other files as plain text. ingest_text(...) does not auto-detect content; callers choose format="text" or format="markdown" explicitly. Both ingest(...) and ingest_text(...) then run the same private ingestion flow for chunking, optional process_chunks(...), embedding, persistence, and extraction.

When you pass chunk_processors= to GraphRAG(...), those processors run after chunking and before chunk embeddings and chunk-level graph extraction. GraphRAG preserves the declared processor-stage order, but chunks within one stage are processed concurrently up to num_workers= while the returned chunk order stays stable. The same num_workers= cap also bounds extract_kg_per_chunk(...).

In the default ingestion flow, vector embeddings are created for chunks, entities, memories, and queries, while document nodes are persisted without document-level vectors.

The main entry points are:

ingest(...) and ingest_text(...) for document ingestion.
search(...) and recall(...) for query-time retrieval.
remember(...) for writing memory nodes.
find_entity_duplicate_candidates(...) and dedupe_entities(...) for duplicate inspection and merge execution.

When resolve_entities_on_ingest=True, the same similarity infrastructure used for duplicate inspection is also applied during ingestion. Extracted entities can then be matched to persisted entities before new nodes are written.

The concurrent chunk processing logic has been extracted into a dedicated ChunkWorkerPool class in grawiki.rag.chunk_workers. The constructor parameters that govern worker behavior — num_workers=, chunk_worker_timeout=, chunk_max_retries=, retry_base_delay=, and skip_failed_chunks= — are accepted by GraphRAG(...) and forwarded to the internal pool; they are not stored as public attributes on GraphRAG itself.

Post-persistence entity deduplication keeps the same facade methods, but the duplicate grouping, master selection, and merge execution logic is delegated to EntityDeduplicator in grawiki.rag.entity_deduplication.

For end-to-end examples, see Flows and the task-oriented guides in How to.

grawiki.rag.graph_rag.GraphRAG ¶

Orchestrate document ingestion and retrieval-augmented search.

GraphRAG is the main facade for the grawiki pipeline. It wires together chunking, embedding, knowledge-graph extraction, entity resolution, and vector/keyword retrieval into a coherent workflow:

Ingest — read a document from disk or a string.
Extract — chunk the document and extract a knowledge graph per chunk.
Resolve — optionally deduplicate extracted entities against the persisted graph.
Persist — store document nodes, chunk nodes, entities, and relationships in the configured graph database.
Query — search the graph with :meth:search or :meth:recall.

The stepwise helpers useful for notebooks and debugging are :meth:read_document, :meth:chunk_document, :meth:process_chunks, :meth:embed_chunks, :meth:build_document_node, :meth:build_chunk_nodes, :meth:persist_document_and_chunks, :meth:extract_kg_per_chunk, and :meth:persist_entities_and_relationships.

See :meth:__init__ for the full list of constructor parameters.

read_document ¶

read_document(path)

Load one source document from disk.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Filesystem path to the source document.	required

Returns:

Type	Description
`Document`	Loaded source document.

chunk_document ¶

chunk_document(document, format=None)

Split a document into chunks.

Parameters:

Name	Type	Description	Default
`document`	`Document`	Source document to segment.	required
`format`	`('text', 'markdown')`	Explicit content-format override. When omitted, the method uses `document.metadata["content_format"]` and falls back to `"text"`.	`"text"`

Returns:

Type	Description
`list[Chunk]`	Chunk sequence produced by the configured chunker. Markdown content uses the markdown pipeline only when one was configured explicitly.

process_chunks `async` ¶

process_chunks(chunks)

Apply configured chunk processors stage-by-stage.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	Chunks to process.	required

Returns:

Type	Description
`list[Chunk]`	Processed chunks in the same order as the input sequence. Processor stages run in configured order, while chunks within one stage run concurrently up to `num_workers`.

embed_chunks `async` ¶

embed_chunks(chunks)

Embed chunk contents in one batch.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	Chunks whose content should be embedded.	required

Returns:

Type	Description
`list[list[float]]`	Embedding vectors aligned with the input chunk order.

build_document_node ¶

build_document_node(document)

Build a document node ready for persistence.

Parameters:

Name	Type	Description	Default
`document`	`Document`	Source document to convert into a persisted node model.	required

Returns:

Type	Description
`DocumentNode`	Prepared document node ready for persistence. Document-level embeddings are not stored; the embedding field is always empty.

build_chunk_nodes ¶

build_chunk_nodes(chunks, embeddings)

Build chunk nodes with embeddings attached.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	Source chunks to convert into persisted node models.	required
`embeddings`	`list[list[float]]`	Embedding vectors aligned with `chunks`.	required

Returns:

Type	Description
`list[ChunkNode]`	Prepared chunk nodes ready for persistence.

Raises:

Type	Description
`ValueError`	Raised when the number of chunks and embeddings does not match.

persist_document_and_chunks `async` ¶

persist_document_and_chunks(document_node, chunk_nodes)

Persist one document node and its chunk nodes with indexes.

Parameters:

Name	Type	Description	Default
`document_node`	`DocumentNode`	Prepared document node.	required
`chunk_nodes`	`list[ChunkNode]`	Prepared chunk nodes associated with the document.	required

extract_kg_per_chunk `async` ¶

extract_kg_per_chunk(chunks, *, show_progress=False)

Extract knowledge graphs for chunks with bounded concurrency.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	Chunks to analyze.	required
`show_progress`	`bool`	When `True`, emit info-level log messages as chunk extraction starts, each chunk finishes, and the overall extraction completes. Defaults to `False`.	`False`

Returns:

Type	Description
`dict[str, KnowledgeGraph]`	Extracted graphs keyed by chunk identifier.

aclose `async` ¶

aclose()

Best-effort close of facade-owned resources.

Notes

This forwards cleanup to initialized chunk processors, the configured knowledge-graph extractor, and the graph database adapter when those objects expose aclose() or close().

persist_entities_and_relationships `async` ¶

persist_entities_and_relationships(owner_ids, owner_graphs)

Persist extracted entities and relationships.

Parameters:

Name	Type	Description	Default
`owner_ids`	`Sequence[str]`	Node identifiers that own the extracted graphs.	required
`owner_graphs`	`dict[str, KnowledgeGraph]`	Extracted graphs keyed by owner identifier.	required

find_entity_duplicate_candidates `async` ¶

find_entity_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)

Run the two-step duplicate-finding heuristic across entities.

Parameters:

Name	Type	Description	Default
`limit`	`int`	Maximum number of candidate hits returned per source entity.	`10`
`threshold`	`float \| None`	Optional matcher-specific minimum score.	`None`
`skip_semantic_key_collisions_in_similarity_scan`	`bool`	Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions.	`True`

Returns:

Type	Description
`EntityDuplicateCandidates`	Combined duplicate-candidate report produced by the injected entity similarity finder.

dedupe_entities `async` ¶

dedupe_entities(*, limit=10, threshold=None, min_merge_score=0.95, dry_run=False)

Find duplicate entities and merge them into canonical masters.

Parameters:

Name	Type	Description	Default
`limit`	`int`	Maximum candidate hits returned per source entity during duplicate inspection.	`10`
`threshold`	`float \| None`	Optional similarity threshold forwarded to the duplicate finder.	`None`
`min_merge_score`	`float`	Minimum candidate score required for inclusion in a merge group.	`0.95`
`dry_run`	`bool`	When `True`, reports are produced without applying destructive DB changes.	`False`

Returns:

Type	Description
`list[MergeReport]`	Reports describing the merge decisions that were made.

ingest `async` ¶

ingest(path, *, show_progress=False)

Run the full ingestion flow for one file.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Source file to ingest.	required
`show_progress`	`bool`	When `True`, emit info-level progress logs for chunk-level knowledge-graph extraction during ingestion. Defaults to `False`.	`False`

Returns:

Type	Description
`None`	This method persists the resulting graph side effects to the configured database.

ingest_text `async` ¶

ingest_text(text, title, *, format='text', metadata=None, show_progress=False)

Ingest a document supplied as a string.

Parameters:

Name	Type	Description	Default
`text`	`str`	Document content to ingest.	required
`title`	`str`	Human-readable document title used as the document name.	required
`format`	`('text', 'markdown')`	Explicit content format for the in-memory document. Defaults to `"text"` and is not auto-detected.	`"text"`
`metadata`	`dict[str, str] \| None`	Additional metadata attached to the transient source document before persistence.	`None`
`show_progress`	`bool`	When `True`, emit info-level progress logs for chunk-level knowledge-graph extraction during ingestion. Defaults to `False`.	`False`

Returns:

Type	Description
`None`	This method persists the resulting graph side effects to the configured database.

remember `async` ¶

remember(memory, *, memory_id=None, name=None, semantic_key=None, metadata=None, related_node_ids=())

Persist one memory, replacing an existing memory when requested.

Parameters:

Name	Type	Description	Default
`memory`	`MemoryNode \| str`	Memory payload to persist. Raw strings are normalized into a new :class:`~grawiki.graph.models.MemoryNode`.	required
`memory_id`	`str \| None`	Existing memory identifier to replace. When omitted, `memory.id` is used as-is.	`None`
`name`	`str \| None`	Optional memory name override. Primarily useful when `memory` is a raw string.	`None`
`semantic_key`	`str \| None`	Optional semantic key override. Defaults to the final memory id.	`None`
`metadata`	`dict[str, str] \| None`	Optional metadata merged into the memory metadata.	`None`
`related_node_ids`	`Sequence[str]`	Existing node ids that should be explicitly linked from the memory.	`()`

Returns:

Type	Description
`MemoryNode`	Persisted memory payload including its final id.

search `async` ¶

search(query, *, limit=10)

Aggregate results from the configured retrievers.

Parameters:

Name	Type	Description	Default
`query`	`str`	Raw user query text.	required
`limit`	`int`	Maximum number of final hits returned after combining retriever outputs.	`10`

Returns:

Type	Description
`list[NodeHit]`	Flat, deduplicated search hits across the configured retrievers. With the default retriever set this typically includes chunk, memory, and keyword-expanded entity results.

Raises:

Type	Description
`RuntimeError`	Raised when every configured retriever fails for the query.

recall `async` ¶

recall(query, *, user_id=None, limit=5, hops=1, limit_per_hop=5)

Search memories and attach connected graph context.

Parameters:

Name	Type	Description	Default
`query`	`str`	Raw user query text.	required
`user_id`	`str \| None`	Optional memory-owner filter applied after memory retrieval.	`None`
`limit`	`int`	Maximum number of memories returned.	`5`
`hops`	`int`	Number of graph-expansion hops to include.	`1`
`limit_per_hop`	`int`	Maximum recall paths expanded per memory seed.	`5`

GraphRAG¶

grawiki.rag.graph_rag.GraphRAG ¶

read_document ¶

chunk_document ¶

process_chunks async ¶

embed_chunks async ¶

build_document_node ¶

build_chunk_nodes ¶

persist_document_and_chunks async ¶

extract_kg_per_chunk async ¶

aclose async ¶

persist_entities_and_relationships async ¶

find_entity_duplicate_candidates async ¶

dedupe_entities async ¶

ingest async ¶

ingest_text async ¶

remember async ¶

search async ¶

recall async ¶

process_chunks `async` ¶

embed_chunks `async` ¶

persist_document_and_chunks `async` ¶

extract_kg_per_chunk `async` ¶

aclose `async` ¶

persist_entities_and_relationships `async` ¶

find_entity_duplicate_candidates `async` ¶

dedupe_entities `async` ¶

ingest `async` ¶

ingest_text `async` ¶

remember `async` ¶

search `async` ¶

recall `async` ¶