Ingest a document¶
This guide shows the stepwise ingestion workflow exposed by GraphRAG. Use it when you want to inspect intermediate artifacts such as chunks, embeddings, document nodes, or per-chunk extraction results.
If you only need the full pipeline, use await rag.ingest(path) instead.
Format detection happens once at read or entry time:
.mdand.markdownfiles are loaded as markdown content..pdffiles are converted to markdown in memory withpymupdf4llmand then follow the same markdown chunking path.ingest_text(...)defaults to plain text, but you can explicitly setformat="markdown"for in-memory markdown.
Prepare the database and facade¶
from pathlib import Path
from grawiki.db import FalkorGraphDB
from grawiki.doc_processing.chunkers import build_default_markdown_pipeline
from grawiki.doc_processing.chunk_processors import HypotheticalQuestionsChunkProcessor
from grawiki.rag import GraphRAG
database = FalkorGraphDB(
"my_graph",
db_path="/tmp/my_graph.db",
)
rag = GraphRAG(
model="openai:gpt-4.1-mini",
embedding_model="openai:text-embedding-3-small",
db=database,
markdown_pipeline=build_default_markdown_pipeline(),
chunk_processors=[
HypotheticalQuestionsChunkProcessor(
model="openai:gpt-4.1-mini",
num_question=3,
)
],
)
await database.setup()
source_path = Path("path/to/document.md")
markdown_pipeline= is optional. When you omit it, markdown and PDF-derived markdown fall back to the generic text Chunker. Pass build_default_markdown_pipeline() or any other Pipeline only when you want markdown-aware text/code/table chunk preservation.
By default, KG extraction emits English entity names and relationship labels. Set kg_output_language="Polish" or another language on GraphRAG(...) when you want the extracted graph strings localized.
Run the ingestion steps explicitly¶
document = rag.read_document(source_path)
chunks = rag.chunk_document(document)
chunks = await rag.process_chunks(chunks)
document_embedding = await rag.embed_document(document)
chunk_embeddings = await rag.embed_chunks(chunks)
document_node = rag.build_document_node(document, document_embedding)
chunk_nodes = rag.build_chunk_nodes(chunks, chunk_embeddings)
await rag.persist_document_and_chunks(document_node, chunk_nodes)
chunk_graphs = await rag.extract_kg_per_chunk(chunks)
await rag.persist_entities_and_relationships(
[chunk.id for chunk in chunks],
chunk_graphs,
)
What each step does¶
read_document(...)loads the source file into aDocumentmodel and recordsfilepath,source_format, andcontent_formatmetadata.chunk_document(...)splits the document intoChunkmodels. Plain text always uses the generic text chunker. Markdown content uses that same generic chunker unless you explicitly passmarkdown_pipeline=onGraphRAG(...); with a configured markdown pipeline, ordered text, code, and table chunks are preserved.process_chunks(...)applies any configuredchunk_processors=in order. Use this step to keep the explicit workflow aligned with one-shot ingestion.embed_document(...)returns an empty list during normal ingestion.embed_chunks(...)creates the retrieval vectors used for chunk search and indexing.build_document_node(...)andbuild_chunk_nodes(...)convert the transient models into persisted graph-node shapes. Document nodes are written without vector embeddings by default.persist_document_and_chunks(...)writes the document and chunk nodes before extraction.extract_kg_per_chunk(...)produces one extracted graph per processed chunk.persist_entities_and_relationships(...)writes the extracted entities and relationships back to the graph.
Ingest markdown already in memory¶
When you already have markdown text in memory, ingest_text(..., format="markdown") follows the same content-format routing. With markdown_pipeline= configured it uses markdown-aware chunking; otherwise it falls back to the generic text chunker:
await rag.ingest_text(
"# Notes\n\nThis paragraph becomes a text chunk.\n\n```python\nprint(1)\n```",
title="In-memory markdown",
format="markdown",
)
Ingest a PDF through the markdown path¶
PDF input is converted to markdown in memory first, then chunked according to the same markdown-content rules:
pdf_document = rag.read_document(Path("path/to/paper.pdf"))
assert pdf_document.metadata["source_format"] == "pdf"
assert pdf_document.metadata["content_format"] == "markdown"
pdf_chunks = rag.chunk_document(pdf_document)
One-shot alternative¶
When you do not need intermediate artifacts, use the facade method:
await rag.ingest(source_path)
GraphRAG.ingest(...) runs the same sequence, including process_chunks(...), and calls setup() internally.