semantica.split breaks documents into chunks that preserve semantic context:
  • Six chunking strategies: recursive, semantic, entity-aware, relation-aware, sliding window, structural
  • SemanticChunker uses embedding-based topic-shift detection to split only when content changes
  • EntityAwareChunker keeps entity mentions intact across chunk boundaries
  • RelationAwareChunker keeps subject-predicate-object triplets within a single chunk
  • Chunking quality directly determines downstream embedding accuracy and entity extraction quality

Why Chunking Matters

Most LLMs and embedding models have fixed context windows. Documents larger than that window must be split. But naive splitting (every 500 characters, regardless of structure) destroys semantic context:
  • An entity mention like “Apple Inc.” split across two chunks loses its context in both
  • A relation triplet like “Steve Jobs founded Apple” split at “Steve Jobs” leaves a dangling subject
  • Embedding a chunk that mixes two unrelated topics produces a centroid vector that matches neither
Semantica’s chunking methods are designed to avoid these failure modes.

Exported Classes

ClassRole
TextSplitterUnified entry point: swap method= without changing downstream code
Chunk{text, start_index, end_index, metadata, id}
SemanticChunkerEmbedding-based topic-shift detection: splits only when content actually changes
StructuralChunkerHeading/section-based splits using structural text analysis
EntityAwareChunkerPrevents named entity mentions from being split across chunk boundaries
RelationAwareChunkerKeeps subject-predicate-object triplets intact within a single chunk
HierarchicalChunkerMulti-level chunking producing parent/child chunk relationships
Available method= values for TextSplitter:
MethodBest for
recursiveGeneral text: splits on paragraphs, sentences, words in order
sentenceConversational text, QA
paragraphLong-form text where paragraph integrity matters
tokenLLM context window enforcement
semantic_transformerLong documents with topic shifts
entity_awareKG extraction pipelines
relation_awareKG pipelines where triplet integrity matters
structuralText with heading/paragraph structure
sliding_windowDense overlap for bi-encoder retrieval

What You Get

TextSplitter

Unified interface for 11 chunking strategies: swap methods without changing downstream code.

Semantic Chunking

Embedding-based topic shift detection: splits only when the topic actually changes.

Entity-Aware Chunking

Entity spans never cross chunk boundaries: guaranteed by boundary adjustment.

Relation-Aware Chunking

Subject–predicate–object triplets kept within a single chunk for KG pipelines.

Chunk Object

Output dataclass with text, character offsets, optional id, and method-specific metadata.

Quick Start

1

Choose a splitting method

from semantica.split import TextSplitter

splitter = TextSplitter(
    method="recursive",   # see Splitting Methods table
    chunk_size=1000,
    chunk_overlap=200,
)
2

Split raw text

chunks = splitter.split(text)

for chunk in chunks:
    print(f"  Start: {chunk.start_index}, End: {chunk.end_index}")
    print(f"  Method: {chunk.metadata.get('method')}")
    print(f"  Preview: {chunk.text[:80]}...")
3

Or split a document object

# split_documents() accepts any object with a .text attribute,
# or a plain string: no specific document class required.
class Doc:
    def __init__(self, text, metadata=None):
        self.text = text
        self.metadata = metadata or {}

doc = Doc(text="Annual report content...", metadata={"source": "annual_report.pdf"})

splitter = TextSplitter(method="structural")
chunks   = splitter.split_documents([doc])

for chunk in chunks:
    print(f"  {chunk.text[:80]}...")
4

Batch-split a list of documents

# split_documents() returns a flat List[Chunk] across all inputs
all_chunks = splitter.split_documents(docs)

for chunk in all_chunks:
    print(chunk.text[:80])

Splitting Methods

MethodHow It SplitsBest For
recursiveParagraph → sentence → word (cascading fallback)General-purpose default
semantic_transformerEmbeds sentences, splits at cosine similarity dropsRAG: topic coherence matters
entity_awareAdjusts boundaries so entity spans are never cutNER pipelines
relation_awareKeeps subject–predicate–object triplets within one chunkKG construction
sentenceSentence boundary detection (regex, NLTK, spaCy)Short documents, Q&A
paragraphParagraph boundary splittingLong-form articles, reports
tokenToken count via tiktoken or transformers; hard cutoffLLM context window prep
wordWord count with overlapSimple token-approximate splits
characterFixed character count with overlap; fastest, no NLPSimple batch jobs
sliding_windowFixed-size window advancing by stride; configurable overlapDense retrieval (ColBERT, DPR)
structuralHeading/paragraph structure detectionText with explicit heading hierarchy
embedding_semanticEmbedding similarity boundaries (alias of semantic_transformer)RAG with embedding-based coherence
hierarchicalMulti-level section → paragraph → sentence chunkingMulti-granularity retrieval

Choosing a Strategy

Use this decision tree before picking a method:
  • Building a KG?relation_aware (keeps triplets intact), then entity_aware for pure NER
  • RAG system where retrieval quality matters most?semantic_transformer
  • Dense overlap for bi-encoder retrieval (ColBERT, DPR)?sliding_window
  • Preparing prompts for a fixed-window LLM?token
  • Structured text with headings?structural
  • Paragraph-level coherence?paragraph or sentence
  • Fast splitting with no NLP overhead?recursive or character

TextSplitter Constructor

from semantica.split import TextSplitter

splitter = TextSplitter(
    method="semantic_transformer",   # chunking strategy: see Splitting Methods table
    chunk_size=1000,                 # target size in characters
    chunk_overlap=200,               # character overlap between adjacent chunks
    similarity_threshold=0.7,        # cosine similarity cutoff (semantic_transformer only)
    model="all-MiniLM-L6-v2",        # sentence-transformers model name (semantic_transformer only)
    ner_method="ml",                 # NER method (entity_aware only)
    relation_method="ml",            # relation extraction method (relation_aware only)
)
ParameterTypeDefaultDescription
methodstr | list[str]"recursive"Chunking strategy, or list of methods as fallback chain
chunk_sizeint1000Target size in characters (not tokens: if you were using token-based sizing before, multiply by ~4 to approximate the same boundary)
chunk_overlapint200Character overlap between adjacent chunks
similarity_thresholdfloat0.7Cosine similarity cutoff for semantic_transformer: lower = more splits
modelstr"all-MiniLM-L6-v2"Sentence-transformers model name for semantic_transformer
ner_methodstr"ml"NER method for entity_aware: "pattern" | "regex" | "ml" | "huggingface" | "llm"
relation_methodstr"ml"Relation extraction method for relation_aware: "ml" | "llm" | "huggingface"
tokenizerstr"gpt-4"tiktoken model name for token method: unrecognised names fall back to cl100k_base
chunk_overlap too small. Without overlap, a fact that spans a chunk boundary is invisible in both chunks. A 10–20% overlap relative to chunk_size is a safe minimum: for chunk_size=1000, set chunk_overlap=100 to 200.

Splitting Method Details

Tries paragraph breaks first, then sentence boundaries, then word boundaries: falling back only when the chunk exceeds chunk_size:
splitter = TextSplitter(method="recursive", chunk_size=1000, chunk_overlap=200)
chunks   = splitter.split(text)
Key behaviours:
  • Preserves paragraph and sentence structure wherever possible
  • Falls back gracefully: never produces chunks larger than chunk_size
  • Overlap ensures context continuity across chunk boundaries
  • Good starting point when you’re unsure which method to use

Chunk Schema

@dataclass
class Chunk:
    text:        str                    # the chunk's text content
    start_index: int                    # character offset of start in source text
    end_index:   int                    # character offset of end in source text
    metadata:    Dict[str, Any]         # method-specific fields: see table below
    id:          Optional[str] = None   # optional chunk identifier
Metadata keys vary by method. Only keys that are actually set by the implementation are listed.
FieldTypeSet byDescription
methodstrall methodsSplitting method that produced this chunk
chunk_sizeintmost methodsCharacter length of this chunk
sentence_countintsentence, semantic_transformer, spaCy pathNumber of sentences in this chunk
paragraph_countintparagraphNumber of paragraphs in this chunk
word_countintwordNumber of words in this chunk
token_countinttoken; sentence/semantic_transformer when spaCy is availableToken count: not always present
entity_countintentity_awareNumber of entities whose boundaries fall in this chunk
entitieslistentity_awareEntity objects whose boundaries fall in this chunk
relation_countintrelation_awareNumber of relation triplets in this chunk
relationshipslistrelation_awareRelation objects in this chunk
element_countintstructuralNumber of structural elements grouped into this chunk
element_typeslist[str]structuralTypes of elements: "heading", "paragraph", "list", etc.

Tokenizer Options

The token method accepts a tokenizer= kwarg that is passed to tiktoken.encoding_for_model(). The value should be a tiktoken model name. Unrecognised names fall back to cl100k_base automatically.
ValueEncoding used
"gpt-4" (default)cl100k_base
"gpt-3.5-turbo"cl100k_base
"text-embedding-ada-002"cl100k_base
Any unrecognised stringFalls back to cl100k_base
If tiktoken is not installed, the token method falls back to splitting by whitespace-separated words.
Wrong tokenizer. The token method passes the tokenizer= value to tiktoken.encoding_for_model(). If the model name is not recognised by tiktoken it silently falls back to cl100k_base. Pass a valid tiktoken model name (e.g. "gpt-4", "gpt-3.5-turbo") to get deterministic behaviour.

Pipeline Integration

TextSplitter can be used standalone or composed manually with other Semantica modules. The example below shows a sequential pattern: parse a file, split the text, then extract entities from each chunk:
from semantica.parse import DocumentParser
from semantica.split import TextSplitter
from semantica.semantic_extract import NERExtractor

# Parse
parser = DocumentParser()
parsed = parser.parse("data/report.pdf")   # returns a dict with "full_text" key

# Split
splitter = TextSplitter(method="semantic_transformer", chunk_size=512)
chunks   = splitter.split(parsed["full_text"])

# Extract from each chunk
ner = NERExtractor(method="ml")

for chunk in chunks:
    entities = ner.extract(chunk.text)
    print(f"  {len(entities)} entities in chunk starting at {chunk.start_index}")
For the full pipeline orchestration API, see the Pipeline reference.

Parse

Parse documents before chunking: produces sections and metadata.

Embeddings

Embed chunks for vector search and semantic chunking.

Semantic Extract

Extract entities and relations from individual chunks.

Pipeline

Integrate splitting as a named pipeline step.