GraphRAG¶
GraphRAG is the main public facade. It combines document ingestion, chunk-level graph extraction, retrieval, agent-memory persistence, and duplicate-entity inspection in one class.
GraphRAG always keeps the generic text chunker ready and can optionally add a markdown-aware pipeline path:
- the generic text
Chunkerfor plain-text content, - an explicit markdown pipeline adapter for markdown content when
markdown_pipeline=is provided.
read_document(...) is the only file-format detection step. It marks .md and .markdown files as markdown content, converts .pdf files to markdown in memory, and leaves other files as plain text. ingest_text(...) does not auto-detect content; callers choose format="text" or format="markdown" explicitly. Both ingest(...) and ingest_text(...) then run the same private ingestion flow for chunking, optional process_chunks(...), embedding, persistence, and extraction.
When you pass chunk_processors= to GraphRAG(...), those processors run after chunking and before chunk embeddings and chunk-level graph extraction.
In the default ingestion flow, vector embeddings are created for chunks, entities, memories, and queries, while document nodes are persisted without document-level vectors.
The main entry points are:
ingest(...)andingest_text(...)for document ingestion.search(...)andrecall(...)for query-time retrieval.remember(...)for writing memory nodes.find_entity_duplicate_candidates(...)anddedupe_entities(...)for duplicate inspection and merge execution.
When resolve_entities_on_ingest=True, the same similarity infrastructure used for duplicate inspection is also applied during ingestion. Extracted entities can then be matched to persisted entities before new nodes are written.
For end-to-end examples, see Flows and the task-oriented guides in How to.
grawiki.rag.graph_rag.GraphRAG
¶
Orchestrate document ingestion and retrieval-augmented search.
The stepwise ingestion helpers exposed for notebooks and debugging are
read_document(...), chunk_document(...), process_chunks(...),
embed_document(...), embed_chunks(...), build_document_node(...),
build_chunk_nodes(...), persist_document_and_chunks(...),
extract_kg_per_chunk(...), and persist_entities_and_relationships(...).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Chat model used by the knowledge graph extractor. |
required |
embedding_model
|
str
|
Embedding model used for documents, chunks, entities, and queries. |
required |
db
|
GraphDB
|
Graph database adapter used for persistence and search. |
required |
chunking_strategy
|
str
|
Chunking strategy passed to :class: |
'sentence'
|
chunk_processors
|
list[ChunkProcessor] | None
|
Optional chunk-level processing steps applied after chunking and before embedding and graph extraction. Useful for enrichment or normalization tasks such as question generation, entity anonymization, or metadata injection. |
None
|
markdown_pipeline
|
Pipeline | None
|
Optional markdown-aware pipeline used for markdown content. When omitted, markdown falls back to the generic text chunker. |
None
|
max_workers
|
int
|
Maximum number of concurrent chunk-level extraction coroutines. |
4
|
embedding
|
Embedding | None
|
Embedding override for tests or debugging. |
None
|
kg_extractor
|
KnowledgeGraphExtractorProtocol | None
|
Knowledge graph extractor override for tests or debugging. |
None
|
kg_output_language
|
str
|
Language used by the default knowledge graph extractor for entity
names, relationship labels, and textual properties. Defaults to
|
'English'
|
kg_extractor_kwargs
|
dict[str, Any] | None
|
Extra keyword arguments forwarded to the default extractor's
|
None
|
similarity_finder
|
EntitySimilarityFinder | None
|
Entity similarity finder used for collision inspection and candidate lookup. Defaults to a finder backed by the vector similarity matcher. |
None
|
resolve_entities_on_ingest
|
bool
|
When |
False
|
entity_resolution_threshold
|
float
|
Minimum cosine-similarity score for two entities to be considered the
same during ingest-time resolution. Only used when
|
0.92
|
read_document
¶
read_document(path)
Load one source document from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Filesystem path to the source document. |
required |
Returns:
| Type | Description |
|---|---|
Document
|
Loaded source document. |
chunk_document
¶
chunk_document(document, format=None)
Split a document into chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
Document
|
Source document to segment. |
required |
format
|
('text', 'markdown')
|
Explicit content-format override. When omitted, the method uses
|
"text"
|
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
Chunk sequence produced by the configured chunker. Markdown content uses the markdown pipeline only when one was configured explicitly. |
process_chunks
async
¶
process_chunks(chunks)
Apply configured chunk processors in sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
list[Chunk]
|
Chunks to process. |
required |
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
Processed chunks in the same order as the input sequence. |
embed_document
async
¶
embed_document(document)
Return no document-level embedding for ingestion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
Document
|
Source document. Kept in the signature for step-method API compatibility. |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
Empty list. Document content is persisted without a vector; chunk, entity, memory, and query embeddings remain the retrieval path. |
embed_chunks
async
¶
embed_chunks(chunks)
Embed chunk contents in one batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
list[Chunk]
|
Chunks whose content should be embedded. |
required |
Returns:
| Type | Description |
|---|---|
list[list[float]]
|
Embedding vectors aligned with the input chunk order. |
build_document_node
¶
build_document_node(document, embedding)
Build a document node with an optional embedding attached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
Document
|
Source document to convert into a persisted node model. |
required |
embedding
|
list[float]
|
Optional embedding vector for the document. The ingestion path now passes an empty list, but the parameter is retained for API compatibility. |
required |
Returns:
| Type | Description |
|---|---|
DocumentNode
|
Prepared document node ready for persistence. |
build_chunk_nodes
¶
build_chunk_nodes(chunks, embeddings)
Build chunk nodes with embeddings attached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
list[Chunk]
|
Source chunks to convert into persisted node models. |
required |
embeddings
|
list[list[float]]
|
Embedding vectors aligned with |
required |
Returns:
| Type | Description |
|---|---|
list[ChunkNode]
|
Prepared chunk nodes ready for persistence. |
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the number of chunks and embeddings does not match. |
persist_document_and_chunks
async
¶
persist_document_and_chunks(document_node, chunk_nodes)
Persist one document node and its chunk nodes with indexes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document_node
|
DocumentNode
|
Prepared document node. |
required |
chunk_nodes
|
list[ChunkNode]
|
Prepared chunk nodes associated with the document. |
required |
extract_kg_per_chunk
async
¶
extract_kg_per_chunk(chunks, *, show_progress=False)
Extract knowledge graphs for chunks with bounded concurrency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
list[Chunk]
|
Chunks to analyze. |
required |
show_progress
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, KnowledgeGraph]
|
Extracted graphs keyed by chunk identifier. |
persist_entities_and_relationships
async
¶
persist_entities_and_relationships(owner_ids, owner_graphs)
Persist extracted entities and relationships.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
owner_ids
|
Sequence[str]
|
Node identifiers that own the extracted graphs. |
required |
owner_graphs
|
dict[str, KnowledgeGraph]
|
Extracted graphs keyed by owner identifier. |
required |
find_similar_entities
async
¶
find_similar_entities(entity, *, limit=10, threshold=None, candidates=None)
Return candidate entities similar to entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
Node
|
Source entity used as the similarity query. |
required |
limit
|
int
|
Maximum number of candidate hits to return. |
10
|
threshold
|
float | None
|
Optional strategy-specific minimum score. |
None
|
candidates
|
list[Node] | None
|
Optional candidate pool. When omitted, persisted entities are loaded from the graph database. |
None
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Ranked similarity candidates. |
Notes
The configured :class:~grawiki.similarity.similarity_finder.EntitySimilarityFinder
decides which concrete matcher implementation is used.
find_entity_collision_candidates
async
¶
find_entity_collision_candidates(*, limit=10, threshold=None)
Return semantic-key collision groups annotated with merge candidates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional strategy-specific minimum score. |
None
|
Returns:
| Type | Description |
|---|---|
list[SemanticKeyCollisionCandidates]
|
Collision groups with per-entity candidate matches. |
Notes
Candidate generation uses the similarity matcher configured on the injected entity similarity finder.
find_entity_duplicate_candidates
async
¶
find_entity_duplicate_candidates(*, limit=10, threshold=None, skip_semantic_key_collisions_in_similarity_scan=True)
Run the two-step duplicate-finding heuristic across entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum number of candidate hits returned per source entity. |
10
|
threshold
|
float | None
|
Optional matcher-specific minimum score. |
None
|
skip_semantic_key_collisions_in_similarity_scan
|
bool
|
Whether the broader similarity scan should exclude entities already involved in exact semantic-key collisions. |
True
|
Returns:
| Type | Description |
|---|---|
EntityDuplicateCandidates
|
Combined duplicate-candidate report produced by the injected entity similarity finder. |
dedupe_entities
async
¶
dedupe_entities(*, limit=10, threshold=None, min_merge_score=0.95, dry_run=False)
Find duplicate entities and merge them into canonical masters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
limit
|
int
|
Maximum candidate hits returned per source entity during duplicate inspection. |
10
|
threshold
|
float | None
|
Optional similarity threshold forwarded to the duplicate finder. |
None
|
min_merge_score
|
float
|
Minimum candidate score required for inclusion in a merge group. |
0.95
|
dry_run
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
list[MergeReport]
|
Reports describing the merge decisions that were made. |
ingest
async
¶
ingest(path, *, show_progress=False)
Run the full ingestion flow for one file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Source file to ingest. |
required |
show_progress
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
None
|
This method persists the resulting graph side effects to the configured database. |
ingest_text
async
¶
ingest_text(text, title, *, format='text', metadata=None, show_progress=False)
Ingest a document supplied as a string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Document content to ingest. |
required |
title
|
str
|
Human-readable document title used as the document name. |
required |
format
|
('text', 'markdown')
|
Explicit content format for the in-memory document. Defaults to
|
"text"
|
metadata
|
dict[str, str] | None
|
Additional metadata attached to the transient source document before persistence. |
None
|
show_progress
|
bool
|
When |
False
|
Returns:
| Type | Description |
|---|---|
None
|
This method persists the resulting graph side effects to the configured database. |
remember
async
¶
remember(memory, *, memory_id=None, name=None, semantic_key=None, metadata=None, related_node_ids=())
Persist one memory, replacing an existing memory when requested.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
memory
|
MemoryNode | str
|
Memory payload to persist. Raw strings are normalized into a new
:class: |
required |
memory_id
|
str | None
|
Existing memory identifier to replace. When omitted, |
None
|
name
|
str | None
|
Optional memory name override. Primarily useful when |
None
|
semantic_key
|
str | None
|
Optional semantic key override. Defaults to the final memory id. |
None
|
metadata
|
dict[str, str] | None
|
Optional metadata merged into the memory metadata. |
None
|
related_node_ids
|
Sequence[str]
|
Existing node ids that should be explicitly linked from the memory. |
()
|
Returns:
| Type | Description |
|---|---|
MemoryNode
|
Persisted memory payload including its final id. |
search
async
¶
search(query, *, limit=10)
Aggregate results from the configured retrievers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Raw user query text. |
required |
limit
|
int
|
Maximum number of final hits returned after combining retriever outputs. |
10
|
Returns:
| Type | Description |
|---|---|
list[NodeHit]
|
Flat, deduplicated search hits across the configured retrievers. With the default retriever set this typically includes chunk, memory, and keyword-expanded entity results. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
Raised when every configured retriever fails for the query. |
recall
async
¶
recall(query, *, user_id=None, limit=5, hops=1, limit_per_hop=5)
Search memories and attach connected graph context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Raw user query text. |
required |
user_id
|
str | None
|
Optional memory-owner filter applied after memory retrieval. |
None
|
limit
|
int
|
Maximum number of memories returned. |
5
|
hops
|
int
|
Number of graph-expansion hops to include. |
1
|
limit_per_hop
|
int
|
Maximum recall paths expanded per memory seed. |
5
|