Parse complex documents: PDFs, DOCX, PPTX, HTML: with high-fidelity table extraction and built-in OCR.
Overview
Docling is integrated into Semantica’sparse module via the DoclingParser. Documents pass through Docling’s layout engine, then feed directly into Semantica’s extraction and KG pipeline.
Multi-format
PDF, DOCX, PPTX, HTML, and more.
Table Extraction
High-fidelity table parsing with header detection.
OCR Support
Built-in OCR for scanned documents.
Markdown Export
Clean Markdown output optimized for LLM consumption.
Installation
Basic Usage
Full Example
DoclingParser Parameters
| Parameter | Default | Description |
|---|---|---|
enable_ocr | False | Enable OCR for scanned pages |
export_format | "markdown" | Output format: "markdown" or "text" |
Parsed Result Structure
See Also
Parse Module
Full DocumentParser and DoclingParser reference.
Ingest Module
Loading documents before parsing.
Semantic Extract
NER and relation extraction on parsed text.
Pipeline
Using DoclingParser in a full pipeline.
