semantica.split breaks documents into chunks that preserve semantic context:
- Six chunking strategies: recursive, semantic, entity-aware, relation-aware, sliding window, structural
SemanticChunkeruses embedding-based topic-shift detection to split only when content changesEntityAwareChunkerkeeps entity mentions intact across chunk boundariesRelationAwareChunkerkeeps subject-predicate-object triplets within a single chunk- Chunking quality directly determines downstream embedding accuracy and entity extraction quality
Why Chunking Matters
Most LLMs and embedding models have fixed context windows. Documents larger than that window must be split. But naive splitting (every 500 characters, regardless of structure) destroys semantic context:- An entity mention like “Apple Inc.” split across two chunks loses its context in both
- A relation triplet like “Steve Jobs founded Apple” split at “Steve Jobs” leaves a dangling subject
- Embedding a chunk that mixes two unrelated topics produces a centroid vector that matches neither
Exported Classes
| Class | Role |
|---|---|
TextSplitter | Unified entry point: swap method= without changing downstream code |
Chunk | {text, start_index, end_index, metadata, id} |
SemanticChunker | Embedding-based topic-shift detection: splits only when content actually changes |
StructuralChunker | Heading/section-based splits using structural text analysis |
EntityAwareChunker | Prevents named entity mentions from being split across chunk boundaries |
RelationAwareChunker | Keeps subject-predicate-object triplets intact within a single chunk |
HierarchicalChunker | Multi-level chunking producing parent/child chunk relationships |
method= values for TextSplitter:
| Method | Best for |
|---|---|
recursive | General text: splits on paragraphs, sentences, words in order |
sentence | Conversational text, QA |
paragraph | Long-form text where paragraph integrity matters |
token | LLM context window enforcement |
semantic_transformer | Long documents with topic shifts |
entity_aware | KG extraction pipelines |
relation_aware | KG pipelines where triplet integrity matters |
structural | Text with heading/paragraph structure |
sliding_window | Dense overlap for bi-encoder retrieval |
What You Get
TextSplitter
Unified interface for 11 chunking strategies: swap methods without changing downstream code.
Semantic Chunking
Embedding-based topic shift detection: splits only when the topic actually changes.
Entity-Aware Chunking
Entity spans never cross chunk boundaries: guaranteed by boundary adjustment.
Relation-Aware Chunking
Subject–predicate–object triplets kept within a single chunk for KG pipelines.
Chunk Object
Output dataclass with text, character offsets, optional id, and method-specific metadata.
Quick Start
Splitting Methods
| Method | How It Splits | Best For |
|---|---|---|
recursive | Paragraph → sentence → word (cascading fallback) | General-purpose default |
semantic_transformer | Embeds sentences, splits at cosine similarity drops | RAG: topic coherence matters |
entity_aware | Adjusts boundaries so entity spans are never cut | NER pipelines |
relation_aware | Keeps subject–predicate–object triplets within one chunk | KG construction |
sentence | Sentence boundary detection (regex, NLTK, spaCy) | Short documents, Q&A |
paragraph | Paragraph boundary splitting | Long-form articles, reports |
token | Token count via tiktoken or transformers; hard cutoff | LLM context window prep |
word | Word count with overlap | Simple token-approximate splits |
character | Fixed character count with overlap; fastest, no NLP | Simple batch jobs |
sliding_window | Fixed-size window advancing by stride; configurable overlap | Dense retrieval (ColBERT, DPR) |
structural | Heading/paragraph structure detection | Text with explicit heading hierarchy |
embedding_semantic | Embedding similarity boundaries (alias of semantic_transformer) | RAG with embedding-based coherence |
hierarchical | Multi-level section → paragraph → sentence chunking | Multi-granularity retrieval |
Choosing a Strategy
Use this decision tree before picking a method:- Building a KG? →
relation_aware(keeps triplets intact), thenentity_awarefor pure NER - RAG system where retrieval quality matters most? →
semantic_transformer - Dense overlap for bi-encoder retrieval (ColBERT, DPR)? →
sliding_window - Preparing prompts for a fixed-window LLM? →
token - Structured text with headings? →
structural - Paragraph-level coherence? →
paragraphorsentence - Fast splitting with no NLP overhead? →
recursiveorcharacter
TextSplitter Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
method | str | list[str] | "recursive" | Chunking strategy, or list of methods as fallback chain |
chunk_size | int | 1000 | Target size in characters (not tokens: if you were using token-based sizing before, multiply by ~4 to approximate the same boundary) |
chunk_overlap | int | 200 | Character overlap between adjacent chunks |
similarity_threshold | float | 0.7 | Cosine similarity cutoff for semantic_transformer: lower = more splits |
model | str | "all-MiniLM-L6-v2" | Sentence-transformers model name for semantic_transformer |
ner_method | str | "ml" | NER method for entity_aware: "pattern" | "regex" | "ml" | "huggingface" | "llm" |
relation_method | str | "ml" | Relation extraction method for relation_aware: "ml" | "llm" | "huggingface" |
tokenizer | str | "gpt-4" | tiktoken model name for token method: unrecognised names fall back to cl100k_base |
Splitting Method Details
- Recursive (default)
- Semantic
- Entity-Aware
- Relation-Aware
- Structural
Tries paragraph breaks first, then sentence boundaries, then word boundaries: falling back only when the chunk exceeds Key behaviours:
chunk_size:- Preserves paragraph and sentence structure wherever possible
- Falls back gracefully: never produces chunks larger than
chunk_size - Overlap ensures context continuity across chunk boundaries
- Good starting point when you’re unsure which method to use
Chunk Schema
Chunk dataclass
Chunk dataclass
Chunk metadata fields
Chunk metadata fields
Metadata keys vary by method. Only keys that are actually set by the implementation are listed.
| Field | Type | Set by | Description |
|---|---|---|---|
method | str | all methods | Splitting method that produced this chunk |
chunk_size | int | most methods | Character length of this chunk |
sentence_count | int | sentence, semantic_transformer, spaCy path | Number of sentences in this chunk |
paragraph_count | int | paragraph | Number of paragraphs in this chunk |
word_count | int | word | Number of words in this chunk |
token_count | int | token; sentence/semantic_transformer when spaCy is available | Token count: not always present |
entity_count | int | entity_aware | Number of entities whose boundaries fall in this chunk |
entities | list | entity_aware | Entity objects whose boundaries fall in this chunk |
relation_count | int | relation_aware | Number of relation triplets in this chunk |
relationships | list | relation_aware | Relation objects in this chunk |
element_count | int | structural | Number of structural elements grouped into this chunk |
element_types | list[str] | structural | Types of elements: "heading", "paragraph", "list", etc. |
Tokenizer Options
Thetoken method accepts a tokenizer= kwarg that is passed to tiktoken.encoding_for_model(). The value should be a tiktoken model name. Unrecognised names fall back to cl100k_base automatically.
| Value | Encoding used |
|---|---|
"gpt-4" (default) | cl100k_base |
"gpt-3.5-turbo" | cl100k_base |
"text-embedding-ada-002" | cl100k_base |
| Any unrecognised string | Falls back to cl100k_base |
tiktoken is not installed, the token method falls back to splitting by whitespace-separated words.
Pipeline Integration
TextSplitter can be used standalone or composed manually with other Semantica modules. The example below shows a sequential pattern: parse a file, split the text, then extract entities from each chunk:
Parse
Parse documents before chunking: produces sections and metadata.
Embeddings
Embed chunks for vector search and semantic chunking.
Semantic Extract
Extract entities and relations from individual chunks.
Pipeline
Integrate splitting as a named pipeline step.
