Four-Layer Architecture
- Layer 1: Ingestion
- Layer 2: Processing
- Layer 3: Intelligence
- Layer 4: Application
Loads data from any source into the pipeline as a unified
SourceDocument.| Source | Module | Notes |
|---|---|---|
| PDF, DOCX, PPTX, HTML, JSON, CSV | ingest.FileIngestor | Supports archives, recursive directory scan |
| Parquet | ingest.ParquetIngestor | PyArrow, Hive-style partitions (v0.5.0) |
| XML | ingest.XMLIngestor | XXE-safe lxml, XSD/DTD validation (v0.5.0) |
| Web pages | ingest.WebIngestor | Configurable depth, link filtering |
| SQL / Snowflake | ingest.DBIngestor / ingest.SnowflakeIngestor | Custom SQL, schema introspection |
| Kafka / streams | ingest.StreamIngestor | Real-time feed ingestion |
ingest.EmailIngestor | IMAP/SMTP with attachment extraction | |
| Repositories | ingest.RepoIngestor | Git repos, code structure |
| MCP | ingest.MCPIngestor | Model Context Protocol sources |
Data Flow
Every pipeline follows the same linear path from raw source to delivered output:Module Map
| Layer | Category | Modules |
|---|---|---|
| Layer 1: Ingestion | Sources | ingest, split |
| Layer 2: Processing | Transform | parse, normalize, semantic_extract, deduplication, conflicts |
| Layer 3: Intelligence | Stores | kg, vector_store, graph_store, triplet_store, embeddings, ontology |
| Layer 4: Application | Delivery | context, reasoning, export, visualization, explorer, pipeline |
| : | Cross-cutting | provenance, change_management, llms, mcp_server, seed, evals, core, utils |
Extension Points
Every layer exposes a registry-based extension point. Register custom implementations and they participate in the full pipeline with zero changes to core code.Design Decisions
Modularity: use only what you need
Modularity: use only what you need
Every component works standalone.
NERExtractor runs without a graph store. VectorStore runs without decision tracking. The framework never forces a full stack instantiation: you pay only for what you import.Pluggability: extend without modifying core
Pluggability: extend without modifying core
Custom ingestors, extractors, validators, and exporters follow the same base class pattern. Register them via
PluginRegistry and they participate in the full pipeline: provenance tracking, retry policies, and parallel execution included: with no changes to core code.Provenance by default
Provenance by default
Lineage tracking is built into graph construction at the lowest level. Every node and edge carries a
source_id pointing back to the originating document, extraction method, and timestamp. There’s no opt-in required: provenance is always on.Configuration over convention
Configuration over convention
Centralized
ConfigManager with environment variable overrides. No magic defaults: all behavior is explicit and overridable. Suitable for multi-environment deployments where dev, staging, and production need different backends.Performance Characteristics
| Characteristic | Mechanism |
|---|---|
| Parallel execution | Pipeline(workers=N) with configurable workers per stage |
| Delta processing | Incremental graph updates: no full recompute on new data |
| Streaming ingestion | Process large corpora without loading everything into memory |
| Backend flexibility | Swap in-memory NetworkX for Neo4j / FalkorDB with no API changes |
| Deduplication v2 | blocking_v2, hybrid_v2, semantic_v2: up to 7x faster than v1 |
| Indexed search | Explorer search at 0.004ms on 118k nodes (v0.5.0) |
Modules
Full module documentation with code examples.
Learning More
Configuration reference, performance guide, and troubleshooting.
Pipeline Reference
Pipeline orchestration, workers, and retry policies.
Core Reference
Framework lifecycle, plugin registry, and configuration.
