semantica.parse extracts structured text, layout, tables, and metadata from unstructured documents:
  • DocumentParser: broad format support (PDF, DOCX, HTML, JSON, CSV, PPTX, XLSX), no extra dependencies
  • DoclingParser: complex layouts, merged-cell tables, multi-column PDFs, OCR (pip install docling)
  • Both return a consistent dict with full_text, metadata, pages, and tables keys
  • parse_batch() processes multiple files in parallel with configurable error handling

Getting Started

Installation

The parse module works out of the box for standard formats:
from semantica.parse import DocumentParser

parser = DocumentParser()
result = parser.parse("document.pdf")
print(result["full_text"])  # Extracted text content
For enhanced table extraction and complex layouts, install the Docling dependency:
pip install docling
from semantica.parse import DoclingParser

parser = DoclingParser(export_format="markdown")
result = parser.parse("document.pdf", extract_tables=True)
print(result["tables"])  # Enhanced table extraction

First Document Parsing

from semantica.parse import DocumentParser

# Parse any supported format
parser = DocumentParser()
result = parser.parse("annual_report.pdf")

# Access extracted content
text = result["full_text"]           # Complete document text
metadata = result["metadata"]        # Document properties
pages = result.get("pages", [])      # Page-level content

print(f"Extracted {len(text)} characters from {metadata.get('page_count', 0)} pages")

Parser Selection Guide

Zero extra dependencies. Use for clean PDFs, Word docs, HTML, and structured formats.
FormatsPDF, DOCX, HTML, TXT, JSON, CSV, PPTX, XLSX
SpeedFast
SetupNone: included in base install
Best forClean documents, broad format support, production pipelines
from semantica.parse import DocumentParser

parser = DocumentParser()
result = parser.parse("contract.pdf")

print(result["full_text"])        # Extracted text
print(result["metadata"])         # Title, author, page count, ...
print(len(result.get("pages", [])))  # Per-page breakdown

Exported Classes

ClassRole
DocumentParserAuto-detects format: delegates to format-specific parser (PDF, DOCX, HTML, JSON, CSV, …)
DoclingParserComplex layouts, merged-cell tables, multi-column PDFs, and OCR (pip install docling)
DoclingMetadataDocument metadata from Docling parsing
PDFParserPDF text and metadata extraction
WebParserURL fetch + HTML parsing
EmailParser.eml / .msg email files with attachment extraction
CodeParserSource code files with syntax-aware block detection

DocumentParser

Standard parser for clean, machine-readable documents:
from semantica.parse import DocumentParser

parser = DocumentParser()
result = parser.parse("data/report.pdf")

print(result["full_text"])      # Complete extracted text
print(result["metadata"])       # Document properties (title, author, page_count, etc.)
if "pages" in result:           # Page-level content (when available)
    print(f"Pages: {len(result['pages'])}")
Supported formats: PDF, DOCX, HTML, TXT, JSON, CSV, PPTX, XLSX.

DoclingParser

Advanced parser using the Docling backend: handles layouts that DocumentParser cannot:
pip install docling
from semantica.parse import DoclingParser

parser = DoclingParser(
    export_format="markdown",      # Export format: "markdown" | "html" | "json"
    enable_ocr=False               # Enable OCR for scanned documents
)

result = parser.parse(
    "data/annual_report.pdf",
    extract_tables=True,           # Extract structured tables
    extract_images=False,          # Extract image regions
    extract_text=True              # Extract text content
)

print(result["full_text"])    # Complete extracted text
print(result["tables"])       # Structured table data
if "pages" in result:         # Page-level content
    print(f"Pages: {len(result['pages'])}")
Use DoclingParser for:
  • Multi-column PDF layouts
  • Tables with merged cells or complex headers
  • PPTX slides with embedded charts
  • XLSX spreadsheets with formulas
  • Scanned documents with OCR
  • Academic papers and technical reports

OCR Support

parser = DoclingParser(
    enable_ocr=True,           # Enable OCR via PdfPipelineOptions
    export_format="markdown"
)

result = parser.parse("data/scanned_contract.pdf")
print(result["full_text"])     # OCR-extracted text

Supported Formats

FormatExtensionParser UsedNotes
PDF.pdfPDFParser / DoclingParserText, tables, metadata; Docling adds OCR
Word.docxBuilt-inText, headings, tables, metadata
HTML.html, .htmHTMLParser / WebParserWebParser fetches remote URLs
Markdown.mdBuilt-inPreserves heading hierarchy
Plain text.txtTXTParserMinimal metadata
JSON.jsonJSONParserOne object per line or array
CSV / TSV.csv, .tsvCSVParserHeader auto-detected
Excel.xlsx, .xlsBuilt-inSheet selection supported
PowerPoint.pptxBuilt-inDoclingParser for embedded charts
Email.eml, .msgEmailParserAttachments extracted
XML.xmlXMLIngestorXXE-safe, optional XSD validation
Archive.zip, .tarFileIngestorRecursive extraction
Source code.py, .js, .java, …CodeParserAST-aware block detection

Parser Output Structure

Both parsers return dictionaries with the following structure:
result = {
    "full_text": str,              # Complete extracted text
    "metadata": dict,              # Document properties and statistics
    "pages": List[dict],           # Page-level content (when available)
    "tables": List[dict],          # Structured table data (DoclingParser)
    "images": List[dict],          # Image regions (DoclingParser)
    "total_pages": int,            # Total page count
    "export_format": str           # Format used for text extraction (DoclingParser)
}

Metadata Structure

metadata = {
    "file_path": str,              # Source file path
    "page_count": int,             # Number of pages
    "format": str,                 # File format ("pdf", "docx", etc.)
    # Additional fields vary by parser and document type
}

DocumentParser Methods

MethodReturnsDescription
parse(source)dictAuto-detect format and extract text, metadata, tables
parse_batch(sources)dictProcess multiple sources in parallel
extract_text(path)strExtract only text content from document
extract_metadata(path)dictExtract only metadata from document

Integration with FileIngestor

The most common pattern: ingest a directory then parse each source:
from semantica.ingest import FileIngestor
from semantica.parse import DoclingParser

ingestor = FileIngestor()
parser   = DoclingParser(export_format="markdown")

sources = ingestor.ingest("data/reports/")
for source in sources:
    result = parser.parse(source)
    # Access extracted content
    text = result["full_text"]
    tables = result["tables"] 
    metadata = result["metadata"]
Docling is an optional dependency. If docling is not installed, DoclingParser raises an ImportError with installation instructions: pip install docling. DocumentParser is always available and requires no extras.

Ingest

Load files before parsing.

Split

Chunk parsed text for embedding and extraction.

Docling Integration

Full Docling integration setup guide.

Semantic Extract

Extract entities and relations from parsed text.