Skip to content

Extraction

This page covers the advanced extraction layer that turns raw text into a graph-shaped intermediate representation before persistence. Most users should use extraction through GraphRAG rather than constructing these pieces directly.

For the public ingestion flow and stepwise examples, start with Flows and How to ingest a document.

Structured output with Instructor

KnowledgeGraphExtractor relies on Instructor for structured LLM output. When extract(...) is called, the chunk text is sent to the configured chat model together with a system prompt that defines the desired node and relationship schema. Instructor requests the model to return JSON matching the ExtractedKnowledgeGraph Pydantic model, validates the response, and surfaces any schema violations as early as possible. This removes the need for manual JSON parsing or ad-hoc regex extraction.

from grawiki.graph.extraction import KnowledgeGraphExtractor

extractor = KnowledgeGraphExtractor(
    model="openai:gpt-4.1-mini",
    embedding=embedder,
    output_language="Polish",
)
graph = await extractor.extract("Alan Turing was a pioneering computer scientist.")

The resulting graph is a KnowledgeGraph whose nodes already carry embeddings and durable UUIDs, ready for persistence.

When output_language is omitted, KnowledgeGraphExtractor defaults to English for extracted entity names, relationship labels, and textual properties.

grawiki.graph.extraction

Knowledge graph extraction helpers.

This module also defines the LLM-facing transient types (:class:ExtractedNode, :class:ExtractedRelationship, :class:ExtractedKnowledgeGraph). They live here rather than in :mod:grawiki.graph.models because they are an implementation detail of extraction — the persisted domain model (Node / Relationship / KnowledgeGraph) does not reference them.

KnowledgeGraphExtractorProtocol

Bases: Protocol

Protocol for chunk-level knowledge graph extractors.

extract async

extract(text)

Extract a graph for one text input.

ExtractedNode

Bases: GraphModel

Extractor-facing node without a machine-generated identifier.

This transient shape is produced by the LLM extractor before the application assigns durable UUIDs and converts the result into persisted :class:~grawiki.graph.models.Node objects.

ExtractedRelationship

Bases: GraphModel

Extractor-facing relationship using node names as endpoints.

Relationship endpoints reference extracted node names within one extraction result and are later rewritten to durable node identifiers during persistence.

ExtractedKnowledgeGraph

Bases: GraphModel

Extractor-facing graph before machine identifiers are assigned.

Node names act as temporary reference keys within one extraction result and are later promoted into a persisted :class:~grawiki.graph.models.KnowledgeGraph.

KnowledgeGraphExtractor

Extract chunk-level knowledge graphs and attach entity embeddings.

Parameters:

Name Type Description Default
model str

Chat model used for structured knowledge extraction. Passed to :func:instructor.from_provider to create the structured-output client.

required
embedding Embedding

Embedding client used for entity node vectors. Injected so callers share one embedding model across the pipeline instead of each component constructing its own.

required
prompt str

Extraction prompt template.

KG_EXTRACTION_PROMPT
max_triplets int

Maximum number of triplets requested from the model.

5
output_language str

Language used for extracted node names, relationship labels, and textual properties. Defaults to "English".

'English'
allowed_entity_types list[str] | None

Optional entity label allow-list.

None
allowed_relation_types list[str] | None

Optional relationship label allow-list.

None
fix_missing_nodes bool

Whether to inject placeholder nodes for relationships that reference missing node names.

True
extract_kwargs dict[str, Any] | None

Extra keyword arguments forwarded to the instructor create call. Useful for provider-specific options such as reasoning_effort or max_retries.

None

extractor_client property

extractor_client

Lazy instructor client initialized on first use.

extract async

extract(text)

Extract a knowledge graph for one text input.

Parameters:

Name Type Description Default
text str

Source text to analyze.

required

Returns:

Type Description
KnowledgeGraph

Extracted graph with embedded entity nodes.