semantica.semantic_extract turns unstructured text into structured graph-ready output: it identifies named entities, extracts relationships between them, detects time-anchored events, resolves coreferences, and serialises everything as RDF triplets. Use it to populate a ContextGraph from raw documents — intelligence reports, clinical notes, regulatory filings, or any free-text corpus.
Extracted entities and relationships feed into
ContextGraph via AgentContext.store(). For how they are attributed back to source documents, see the Provenance Guide. For how the populated graph is queried and traversed, see Context Graphs.Step 1 — Named Entity Recognition: who and what is in the text
NamedEntityRecognizer extracts meaningful nouns from a document and lets you choose the underlying method depending on your latency budget and domain requirements:
confidence field tells you how certain the extractor is. Values below your threshold (here 0.75) are filtered before they reach you. The label field uses either standard CoNLL types (PERSON, ORG, GPE) or domain-specific ones the LLM infers from context (THREAT_ACTOR, MALWARE, CVE, NETWORK).
Once you have the flat list of entities, group them by type to make the next steps easier:
Step 2 — Relation Extraction: how the entities connect
RelationExtractor produces the web of connections between entities — who deployed what, who supplied whom, which CVE targets which product:
context field on each Relation stores the surrounding sentence. This lets you audit why the extractor made a given connection — essential when analysts need to verify that a link is grounded in the source text before acting on it.
Step 3 — Event Detection: what happened, when, and to whom
EventDetector surfaces structured time-anchored events — discrete occurrences with participants, time windows, and locations:
extract(). Each dict carries a content key and an optional id for provenance tracking:
Step 4 — Coreference Resolution: one entity, many names
CoreferenceResolver collapses references like “GAMMA-7”, “the group”, “they”, and “the threat actor” into canonical chains so downstream extraction doesn’t treat them as separate entities:
Step 5 — Triplet Extraction and RDF Serialisation: graph-ready output
TripletExtractor converts everything into subject-predicate-object triplets and serialises them as RDF, ready for graph ingestion and SPARQL queries:
(GAMMA-7, deployed, HAMMERTOSS) with include_temporal=True will carry the time interval from the Event you detected in step 3 — keeping the graph queryable not just by what happened but by when.
Putting it together: a reusable extraction pipeline
Chain all five steps into a single function you can call on every incoming document:Domain examples
- Defense — CTI/Threat
- Security — SOC/Incident
- Life Science — Clinical/Pharma
- Banking — Risk/Compliance
Finished intelligence reports contain threat actors, malware, CVEs, infrastructure clusters, and operation timelines. LLM-backed NER handles custom entity labels (THREAT_ACTOR, OPERATION) that spaCy’s off-the-shelf models miss, while RDF serialisation produces Turtle output compatible with STIX 2.1 object types.
Choosing your extraction method
The six extraction methods trade off speed, accuracy, and infrastructure:"pattern"and"regex"— no dependencies, under 5 ms, ideal as the last fallback in any method chain. Reliable for narrow, predictable domains like CVE identifiers or IP addresses."rules"— linguistic rule-based detection, also offline, under 10 ms."ml"/"spacy"— general English NER at 50–200 ms with no API calls. Install withpip install spacy && python -m spacy download en_core_web_sm. The best default for production pipelines where LLM cost is a concern."huggingface"— domain-specific fine-tuned models at 200 ms–2 s. Used4data/biomedical-ner-allfor pharma,dslim/bert-base-NERfor general high-accuracy NER. Install withpip install "semantica[huggingface]"."llm"— highest recall for implicit entities and custom label schemas, 1–10 s per document. Always pair with a fallback:methods=["llm", "ml", "pattern"].
Related Guides
- Provenance Guide — track every extracted entity and chunk back to its source document
- Agent Memory Guide — store extracted knowledge as searchable agent memories with graph enrichment
- Context Graphs Guide — how extracted entities populate
ContextGraphnodes and edges - GraphRAG Guide — retrieve facts from the populated graph to ground LLM responses
- Reasoning Guide — derive new facts, run SPARQL queries, and apply inference rules over the extracted graph
- Semantic Extract Reference — full API for all extractor classes, providers, and validators
