semantica.ingest is the universal entry point for loading data into Semantica:
  • 15+ ingestion adapters: files, web, SQL, Snowflake, Kafka, MCP, Git repos, email
  • PyArrow Parquet with column selection and partitioned dataset support
  • XXE-safe lxml XML with optional XSD schema validation
  • ingest() unified dispatcher: auto-detects source type from path or URL
  • Each ingestor returns its own typed object (FileObject, WebContent, TableData, etc.)

Exported Classes

ClassRole
FileIngestorPDF, DOCX, HTML, JSON, CSV, Excel, PPTX, ZIP/TAR: type auto-detected from extension
CloudStorageIngestorUnified client for AWS S3, Google Cloud Storage, and Azure Blob Storage
WebIngestorWeb scraping and crawling with ingest_url, crawl_sitemap, crawl_domain
RESTIngestorGeneric REST API ingestion with headers, params, retries, and pagination
PublicAPIIngestorNo-auth public API ingestion with pre-configured examples and rate limiting
FeedIngestorRSS/Atom feed ingestion with live monitoring via FeedMonitor
StreamIngestorReal-time ingestion from Kafka, RabbitMQ, AWS Kinesis, and Apache Pulsar
RepoIngestorGit repositories: source files, commit history, and metadata
DBIngestorSQL databases via SQLAlchemy: tables, views, and custom queries
SnowflakeIngestorSnowflake data warehouse queries and table exports
ParquetIngestorApache Parquet files and partitioned datasets with column selection
XMLIngestorXXE-safe XML parsing with optional XSD schema validation
EmailIngestorIMAP/POP3 email ingestion with attachment extraction
OntologyIngestorOWL/RDF/Turtle ontology file ingestion
MCPIngestorModel Context Protocol (MCP) resource ingestion
ingest()Unified dispatcher: detects source type automatically from path or URL

Getting Started

Use FileIngestor for local files: it auto-detects format from the file extension and handles archives:
from semantica.ingest import FileIngestor

ingestor = FileIngestor()

# Single file -> FileObject
file_obj = ingestor.ingest_file("data/report.pdf")
print(file_obj.name)       # "report.pdf"
print(file_obj.file_type)  # "pdf"
print(file_obj.text)       # decoded text content (property on FileObject)
print(file_obj.size)       # bytes

# Directory scan -> List[FileObject]
files = ingestor.ingest_directory("data/", recursive=True)
for f in files:
    print(f.name, f.file_type, f.size)
FileIngestor is always the fastest path for local files. It auto-detects format from extension, handles ZIP/TAR archives automatically, and reads content into .content bytes or the .text property. Use read_content=False when you only need file metadata.
For web, database, or stream sources, each ingestor exposes its own typed method:
# Web
from semantica.ingest import WebIngestor
wc = WebIngestor(delay=1.0, respect_robots=True).ingest_url("https://example.com")
print(wc.title, wc.text)

# Database: constructor takes no required args; pass connection_string to methods
from semantica.ingest import DBIngestor
db = DBIngestor()
result = db.ingest_database("postgresql://user:pass@localhost/db")
# result["tables"]["documents"]["rows"] contains the rows

# Unified dispatcher: auto-detects source type
from semantica.ingest import ingest
result = ingest("data/report.pdf")          # -> {"files": [FileObject]}
result = ingest("https://example.com")      # -> {"content": WebContent}
result = ingest("data/events.parquet")      # -> {"data": ParquetData}
result = ingest("ontology.ttl")             # -> {"ontology": OntologyData}

Quick Start

1

Ingest local files

from semantica.ingest import FileIngestor

ingestor = FileIngestor()

# Single file: type auto-detected from extension
file_obj = ingestor.ingest_file("data/report.pdf")

# Recursive directory scan
files = ingestor.ingest_directory("data/", recursive=True)

# ingest() also works: routes to file or directory automatically
from semantica.ingest import ingest
result = ingest("data/report.pdf")   # {"files": [FileObject]}
2

Connect to a database

from semantica.ingest import DBIngestor

ingestor = DBIngestor()

# Ingest all tables
result = ingestor.ingest_database(
    "postgresql://user:pass@localhost/db",
    include_tables=["documents"],
)
# result["tables"]["documents"]["rows"] contains the row dicts

# Run a custom query
rows = ingestor.execute_query(
    "postgresql://user:pass@localhost/db",
    "SELECT id, content, created_at FROM documents WHERE status = :s",
    s="active",
)
3

Feed into the pipeline

from semantica.ingest import FileIngestor
from semantica.pipeline import PipelineBuilder, ExecutionEngine
from semantica.parse import DocumentParser
from semantica.semantic_extract import NERExtractor

ingestor  = FileIngestor()
parser    = DocumentParser()
extractor = NERExtractor(method="ml")

builder = PipelineBuilder()
builder.add_step("ingest",  "file_ingest",    handler=ingestor.ingest_file)
builder.add_step("parse",   "document_parse", handler=parser.parse)
builder.add_step("extract", "ner_extract",    handler=extractor.extract)
builder.connect_steps("ingest", "parse")
builder.connect_steps("parse",  "extract")

pipeline = builder.build("my_pipeline")
result   = ExecutionEngine().execute_pipeline(pipeline, data="data/report.pdf")

Ingestors

FileIngestor

from semantica.ingest import FileIngestor

ingestor = FileIngestor()

# Single file
file_obj = ingestor.ingest_file("data/report.pdf")

# Directory: returns List[FileObject]
files = ingestor.ingest_directory("data/", recursive=True)

# ingest() dispatches to ingest_file or ingest_directory automatically
files = ingestor.ingest("data/")
Supported formats: PDF, DOCX, TXT, HTML, JSON, CSV, Excel (XLSX/XLS), PPTX, ZIP/TAR archives.
Glob patterns (e.g. "data/**/*.docx") are not supported. ingest() accepts a file path or a directory path only. To filter by extension inside a directory, use ingest_directory() with the pattern= filter option.

ParquetIngestor

PyArrow-based ingestion for Apache Parquet files, including Hive-style partitioned datasets:
from semantica.ingest import ParquetIngestor

ingestor = ParquetIngestor()

# Single Parquet file -> ParquetData
data = ingestor.ingest_file("data/events.parquet")

# Partitioned directory (year=2024/month=01/...)
data = ingestor.ingest_directory("data/partitioned/")

# Load only specific columns: pass as kwarg
from semantica.ingest import ingest_parquet
data = ingest_parquet("data/events.parquet", columns=["id", "text", "timestamp"])

# Extract schema without loading data
schema = ingest_parquet("data/events.parquet", method="schema")
Requires pyarrow: pip install pyarrow.
Use ParquetIngestor instead of FileIngestor for structured analytical data. Parquet ingestion preserves column types (int, float, datetime) that CSV reading loses. Use columns=["id", "text"] to avoid loading unused columns: critical for wide tables with hundreds of columns.

XMLIngestor

XXE-safe lxml-based ingestion with optional schema validation:
from semantica.ingest import XMLIngestor

# Basic ingestion
ingestor = XMLIngestor()
data = ingestor.ingest_file("data/records.xml")

# With XSD validation: pass schema_path as kwarg
from semantica.ingest import ingest_xml
data = ingest_xml("data/records.xml", schema_path="schema.xsd")

# Validation report only
report = ingest_xml("data/feed.xml", method="validate", schema_path="schema.xsd")

# Directory scan
results = ingestor.ingest_directory("data/records/")
XMLIngestor uses lxml with resolve_entities=False to prevent XML External Entity (XXE) injection attacks.
XMLIngestor is XXE-safe by default. Do not use standard xml.etree.ElementTree to pre-parse XML before passing to Semantica: it does not block XXE attacks. XMLIngestor uses lxml with resolve_entities=False to safely parse untrusted XML.

ingest() Unified Dispatcher

ingest() auto-detects source type from the path or URL and routes to the appropriate ingestor. It returns a Dict[str, Any] where the key depends on source type:
from semantica.ingest import ingest

# File
result = ingest("report.pdf")               # {"files": [FileObject]}
result = ingest("data/", source_type="file") # {"files": [FileObject, ...]}

# Web
result = ingest("https://example.com")       # {"content": WebContent}

# Feed (auto-detected from URL pattern)
result = ingest("https://example.com/feed.xml") # {"feeds": FeedData}

# Parquet (auto-detected from .parquet extension)
result = ingest("events.parquet")            # {"data": ParquetData}

# XML (auto-detected from .xml extension)
result = ingest("records.xml")               # {"xml": XMLIngestionData}

# Ontology (auto-detected from .ttl/.owl/.rdf)
result = ingest("ontology.ttl")              # {"ontology": OntologyData}

# Database (auto-detected from connection string prefix)
result = ingest("postgresql://user:pass@localhost/db") # {"data": ...}

# Public API
result = ingest(
    "https://jsonplaceholder.typicode.com/posts",
    source_type="public_api",
)                                            # {"data": APIData}

ingest() Parameters

ParameterTypeDefaultDescription
sourcesstr, Path, or ListrequiredFile path, URL, directory, or connection string
source_typestrNone (auto-detected)"file", "web", "public_api", "feed", "stream", "repo", "email", "db", "parquet", "xml", "ontology", "mcp"
methodstrNoneOptional method override passed to the underlying ingestor
**kwargsExtra options forwarded to the underlying ingestor method

FileObject Fields

FileIngestor returns FileObject instances:
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Dict, Optional

@dataclass
class FileObject:
    path:        str                    # absolute file path
    name:        str                    # filename (e.g. "report.pdf")
    size:        int                    # size in bytes
    file_type:   str                    # detected type without dot (e.g. "pdf", "docx")
    mime_type:   Optional[str]          # MIME type if detectable
    content:     Optional[bytes]        # raw bytes (None if read_content=False)
    metadata:    Dict[str, Any]         # extension, parent dir, is_supported, etc.
    ingested_at: datetime               # ingestion timestamp

    @property
    def text(self) -> str:
        """Decoded text from content bytes (UTF-8 with latin-1 fallback)."""
        ...
To get text from an ingested file, use the .text property:
file_obj = FileIngestor().ingest_file("report.pdf")
text = file_obj.text       # decoded string
raw  = file_obj.content    # raw bytes
Skip reading content (useful for directory scanning without loading files):
files = FileIngestor().ingest_directory("data/", recursive=True, read_content=False)

OntologyIngestor

Ingest existing OWL or RDF ontology files as structured knowledge sources:
from semantica.ingest import OntologyIngestor

ingestor = OntologyIngestor()

data = ingestor.ingest_ontology("domain_ontology.owl", format="turtle")

# Or using the convenience function
from semantica.ingest import ingest_ontology
data = ingest_ontology("domain_ontology.ttl")

Custom Ingestors

Register a custom ingestor function to participate in the full registry:
from semantica.ingest.registry import method_registry
from semantica.ingest import FileObject

def my_ingestor(source, **kwargs):
    # Return whatever your format produces
    return FileObject(
        path=source,
        name=source,
        size=0,
        file_type="custom",
        content=b"...",
        metadata={},
    )

method_registry.register("file", "my_format", my_ingestor)

# Now callable via the convenience function
from semantica.ingest import ingest_file
result = ingest_file("source_path", method="my_format")

Parse

Parse raw sources into structured text and tables.

Pipeline

Orchestrate ingest as the first pipeline step.

Snowflake Integration

Snowflake-specific setup and authentication guide.

Provenance

Track lineage from ingest through to inference.