from semantica.semantic_extract import NERExtractor# Start with pattern method (no setup required)ner = NERExtractor(method="pattern")entities = ner.extract("Apple Inc. was founded by Steve Jobs.")print(f"Found {len(entities)} entities")# Output: Found 2 entities# Upgrade to LLM for better accuracyfrom semantica.llms import Groqimport osllm = Groq(api_key=os.getenv("GROQ_API_KEY"))ner = NERExtractor(method="llm", llm_provider=llm)entities = ner.extract("Apple Inc. was founded by Steve Jobs.")
NamedEntityRecognizer is the high-level coordinator with confidence thresholding and overlap merging. NERExtractor is the lower-level implementation. For most use cases, start with NERExtractor for simplicity or NamedEntityRecognizer for fine-grained control.
Class
Role
NamedEntityRecognizer
High-level NER with confidence thresholding and overlap merging
NERExtractor
Core NER implementation: use directly for simplicity
Zero dependencies, no API key required. Uses spaCy rules and regex to match standard entity types.
Setup
None: works out of the box
Cost
Free
Accuracy
Good for standard entity types
Best for
Quick prototyping, batch processing, air-gapped systems
from semantica.semantic_extract import NERExtractor, RelationExtractorner = NERExtractor(method="pattern")entities = ner.extract("Apple Inc. was founded by Steve Jobs in Cupertino.")rel = RelationExtractor(method="pattern")relationships = rel.extract(text, entities=entities)
Use any pre-trained or fine-tuned transformer model. Free inference, runs locally.
Setup
pip install semantica[models-huggingface]
Cost
Free (local compute)
Accuracy
Excellent for domain-specific NER
Best for
Medical NER, custom fine-tunes, no API cost
from semantica.semantic_extract import NERExtractorner = NERExtractor(method="huggingface")# Pass model per-callentities = ner.extract(text, model="dslim/bert-base-NER", device="cpu")# Biomedical NERentities = ner.extract(text, model="d4data/biomedical-ner-all")
Highest accuracy for complex schemas and custom entity types. Requires an LLM API key.
Try methods in priority order: guarantees non-empty results even when the preferred method is unavailable.
from semantica.semantic_extract import NERExtractor, RelationExtractor# Try LLM first, fall back to pattern on errorner = NERExtractor(method=["llm", "pattern"])rel = RelationExtractor(method=["llm", "pattern"])# Always returns results: safe for production pipelinesentities = ner.extract(text)relationships = rel.extract(text, entities=entities)
Use fallback chains in pipelines where API availability isn’t guaranteed (rate limits, network issues). The first method in the list is always tried first.
from semantica.semantic_extract import NERExtractor, RelationExtractor, TripletExtractorfrom semantica.llms import Groqimport ostext = "Apple Inc. was founded by Steve Jobs in Cupertino in 1976."llm = Groq(model="llama-3.3-70b-versatile", api_key=os.getenv("GROQ_API_KEY"))entities = NERExtractor(method="llm", llm_provider=llm).extract(text)relationships = RelationExtractor(method="llm", llm_provider=llm).extract(text, entities=entities)triplets = TripletExtractor(method="llm", llm_provider=llm).extract(text)
from semantica.semantic_extract import NERExtractorfrom semantica.llms import Groqimport os# Pattern-based: fast, no API key, good for standard entity typesner = NERExtractor(method="pattern")entities = ner.extract("Apple Inc. was founded by Steve Jobs in Cupertino.")# HuggingFace-based: custom models, no API costner = NERExtractor(method="huggingface")entities = ner.extract(text, model="dslim/bert-base-NER", device="cpu")# LLM-based: best accuracy, handles complex schemas and custom typesllm = Groq(model="llama-3.3-70b-versatile", api_key=os.getenv("GROQ_API_KEY"))ner = NERExtractor(method="llm", llm_provider=llm, max_retries=3)entities = ner.extract(text)
v0.5.0 fix:NERExtractor(method="llm") no longer silently falls back to pattern extraction on custom gateways. The response_format=json_object parameter is now conditionally omitted for incompatible gateways, with a plain generate() + JSON parsing fallback applied automatically.
Resolve pronoun and alias references to canonical entities before extraction:
from semantica.semantic_extract import CoreferenceResolverresolver = CoreferenceResolver()resolved_text = resolver.resolve( "Apple Inc. was founded in 1976. The company is headquartered in Cupertino.")# "Apple Inc." replaces "The company" for consistent downstream extraction
All extractors automatically detect batch input and process multiple texts efficiently:
# Batch processing with list inputtexts = ["Apple Inc. was founded by Steve Jobs.", "Google was founded by Larry Page.", "Microsoft was founded by Bill Gates."]ner = NERExtractor(method="llm", llm_provider=llm)batch_results = ner.extract(texts) # Returns List[List[Entity]]# Process resultsfor i, doc_entities in enumerate(batch_results): print(f"Document {i}: {len(doc_entities)} entities") for entity in doc_entities: print(f" - {entity.text} ({entity.label})")
Batch Input Options:
# Option 1: List of stringstexts = ["Text 1...", "Text 2...", "Text 3..."]results = ner.extract(texts)# Option 2: List of documents with IDs (adds provenance metadata)documents = [ {"id": "doc_1", "content": "Apple Inc. was founded by Steve Jobs."}, {"id": "doc_2", "content": "Google was founded by Larry Page."}]results = ner.extract(documents) # Entities include document_id in metadata