Semantica is built around a four-layer modular architecture. Import only what you need: the framework never forces a full stack. Every component is independently swappable, and every layer communicates through clean interfaces with no hidden coupling.

Four-Layer Architecture

Semantica four-layer architecture
Loads data from any source into the pipeline as a unified SourceDocument.
SourceModuleNotes
PDF, DOCX, PPTX, HTML, JSON, CSVingest.FileIngestorSupports archives, recursive directory scan
Parquetingest.ParquetIngestorPyArrow, Hive-style partitions (v0.5.0)
XMLingest.XMLIngestorXXE-safe lxml, XSD/DTD validation (v0.5.0)
Web pagesingest.WebIngestorConfigurable depth, link filtering
SQL / Snowflakeingest.DBIngestor / ingest.SnowflakeIngestorCustom SQL, schema introspection
Kafka / streamsingest.StreamIngestorReal-time feed ingestion
Emailingest.EmailIngestorIMAP/SMTP with attachment extraction
Repositoriesingest.RepoIngestorGit repos, code structure
MCPingest.MCPIngestorModel Context Protocol sources

Data Flow

Every pipeline follows the same linear path from raw source to delivered output: Semantica 8-step pipeline: Ingest → Parse → Normalize → Extract → Build KG → QA → Store → Deliver

Module Map

LayerCategoryModules
Layer 1: IngestionSourcesingest, split
Layer 2: ProcessingTransformparse, normalize, semantic_extract, deduplication, conflicts
Layer 3: IntelligenceStoreskg, vector_store, graph_store, triplet_store, embeddings, ontology
Layer 4: ApplicationDeliverycontext, reasoning, export, visualization, explorer, pipeline
:Cross-cuttingprovenance, change_management, llms, mcp_server, seed, evals, core, utils

Extension Points

Every layer exposes a registry-based extension point. Register custom implementations and they participate in the full pipeline with zero changes to core code.
from semantica.ingest.registry import method_registry

def custom_file_ingestor(source):
    # Return a list of document dicts with 'text', 'metadata', 'source'
    return [{"text": "...", "metadata": {}, "source": source}]

# Register under the "file" task category with a unique name
method_registry.register("file", "my_custom_format", custom_file_ingestor)

available = method_registry.list_all("file")

Design Decisions

Every component works standalone. NERExtractor runs without a graph store. VectorStore runs without decision tracking. The framework never forces a full stack instantiation: you pay only for what you import.
Custom ingestors, extractors, validators, and exporters follow the same base class pattern. Register them via PluginRegistry and they participate in the full pipeline: provenance tracking, retry policies, and parallel execution included: with no changes to core code.
Lineage tracking is built into graph construction at the lowest level. Every node and edge carries a source_id pointing back to the originating document, extraction method, and timestamp. There’s no opt-in required: provenance is always on.
Centralized ConfigManager with environment variable overrides. No magic defaults: all behavior is explicit and overridable. Suitable for multi-environment deployments where dev, staging, and production need different backends.

Performance Characteristics

CharacteristicMechanism
Parallel executionPipeline(workers=N) with configurable workers per stage
Delta processingIncremental graph updates: no full recompute on new data
Streaming ingestionProcess large corpora without loading everything into memory
Backend flexibilitySwap in-memory NetworkX for Neo4j / FalkorDB with no API changes
Deduplication v2blocking_v2, hybrid_v2, semantic_v2: up to 7x faster than v1
Indexed searchExplorer search at 0.004ms on 118k nodes (v0.5.0)

Modules

Full module documentation with code examples.

Learning More

Configuration reference, performance guide, and troubleshooting.

Pipeline Reference

Pipeline orchestration, workers, and retry policies.

Core Reference

Framework lifecycle, plugin registry, and configuration.