semantica.normalize standardizes raw data before extraction and graph construction:
  • Text cleaning: Unicode NFC/NFKC, whitespace collapse, smart-quote and dash normalization
  • Entity canonicalization: alias resolution and disambiguation via configurable alias maps
  • Date normalization: any format → ISO 8601, including relative dates
  • Number conversion: "$1.2B"1200000000.0 with unit and currency handling
  • Language detection and encoding repair for inconsistent source data
All normalizers expose convenience functions (one-liners) and stateful class instances (full control).

Why Normalize Before Extraction

Unstructured data is inconsistent by nature. Without normalization, the same real-world entity appears as dozens of variants in your graph:
  • "Apple Inc.", "Apple Computer Inc.", "APPLE INC.": multiple nodes, one company
  • "Jan 1st, 2020", "01/01/2020", "2020-01-01": three formats, one date
  • "$1.2B", "1,200,000,000", "1.2 billion USD": three strings, one number
  • "Hello World" vs "Hello World": a non-breaking space that breaks string matching
Normalization collapses these variants before any extractor, deduplicator, or graph builder sees the data.

Exported Classes

ClassRole
TextNormalizerUnicode forms (NFC/NFKC), whitespace collapse, smart-quote and dash normalization
EntityNormalizerAlias resolution and entity disambiguation using configurable alias maps
DateNormalizerParses any date string format to ISO 8601; handles relative dates
NumberNormalizer"$1.2B"1200000000.0; unit conversion; currency parsing
DataCleanerDetect and remove duplicates, handle missing values, validate records
LanguageDetectordetect(text) → language code str; detect_with_confidence(text)(code, score) tuple
EncodingHandlerdetect(bytes)(encoding, confidence) tuple; convert_to_utf8(bytes)str

Getting Started

from semantica.normalize import (
    TextNormalizer,
    DateNormalizer,
    NumberNormalizer,
    LanguageDetector,
    EncodingHandler,
)

# Text: normalize unicode, collapse whitespace, replace smart quotes
normalizer = TextNormalizer()
clean = normalizer.normalize_text("  Hello,  World…  ")
# → "Hello, World..."

# Date
date_norm = DateNormalizer()
date = date_norm.normalize_date("Jan 1st, 2020")
# → "2020-01-01T00:00:00+00:00"

# Number
num_norm = NumberNormalizer()
num = num_norm.normalize_number("$1.2B")
# → 1200000000.0

# Language: returns a language code string
detector = LanguageDetector()
lang = detector.detect("Bonjour le monde")
# → "fr"

# Encoding: returns (encoding_name, confidence) tuple
handler = EncodingHandler()
encoding, confidence = handler.detect(raw_bytes)
utf8_text = handler.convert_to_utf8(raw_bytes)
1

EncodingHandler: fix encoding first

Broken bytes corrupt everything downstream. Always run this before anything else.
from semantica.normalize import EncodingHandler

handler = EncodingHandler()
# detect returns (encoding_name, confidence_score)
encoding, confidence = handler.detect(raw_bytes)
# convert_to_utf8 returns a str
utf8_text = handler.convert_to_utf8(raw_bytes)
Run encoding repair before anything else. A single cp1252 character in a UTF-8 stream silently corrupts the surrounding text. Call handler.convert_to_utf8(raw_bytes) first, before any other normalizer sees the data.
2

TextNormalizer: unicode, whitespace, special chars

from semantica.normalize import TextNormalizer

normalizer = TextNormalizer()
# normalize_text takes per-call options, not constructor params
clean_text = normalizer.normalize_text(
    utf8_text,
    unicode_form="NFC",
    case="preserve",
)
Don’t lowercase before NER. normalize_text(text, case="lower") before entity extraction destroys capitalization signals that NER relies on. Apply case normalization only after extraction if needed.
3

EntityNormalizer: canonicalize entity names

from semantica.normalize import EntityNormalizer

# Provide aliases in config so the resolver can map variants
normalizer = EntityNormalizer(alias_map={
    "apple computer inc.": "Apple Inc.",
    "apple computer, inc.": "Apple Inc.",
})
canonical = normalizer.normalize_entity(
    "Apple Computer, Inc.", entity_type="Organization"
)
# → "Apple Inc." (if the alias_map contains it, else title-cased input)
EntityNormalizer has no built-in corporate suffix expansion. There is no automatic mapping of "Apple Computer Inc.""Apple Inc.". To canonicalize corporate names, provide an explicit alias_map with lowercase keys: EntityNormalizer(alias_map={"apple computer inc.": "Apple Inc."}).
4

DateNormalizer and NumberNormalizer: parse structured values

from semantica.normalize import DateNormalizer, NumberNormalizer

date_norm = DateNormalizer()
num_norm  = NumberNormalizer()

# format and timezone are passed to normalize_date(), not to the constructor
date = date_norm.normalize_date("Jan 1st, 2020", format="ISO8601", timezone="UTC")
# → "2020-01-01T00:00:00+00:00"

num = num_norm.normalize_number("$1.2B")
# → 1200000000.0
5

LanguageDetector: detect language on clean text

from semantica.normalize import LanguageDetector

detector = LanguageDetector()

# detect() returns a language code string
lang = detector.detect("Bonjour le monde")
# → "fr"

# detect_with_confidence() returns (code, score) tuple
lang, confidence = detector.detect_with_confidence("Bonjour le monde")
# → ("fr", 0.98)

Convenience Functions

The fastest path: one import, one call:
from semantica.normalize import (
    normalize_text, normalize_entity, normalize_date,
    normalize_number, clean_data, detect_language, handle_encoding,
)

clean  = normalize_text("  Hello,   World  ")
# → "Hello, World"

entity = normalize_entity("John Doe", entity_type="Person")
# → "John Doe" (title-cased; alias resolution requires alias_map)

date   = normalize_date("Jan 1st, 2020")
# → "2020-01-01T00:00:00+00:00"

num    = normalize_number("$1.2B")
# → 1200000000.0

# detect_language returns a language code string by default
lang   = detect_language("Bonjour le monde")
# → "fr"

# handle_encoding with operation="detect" returns (encoding, confidence) tuple
encoding, confidence = handle_encoding(raw_bytes, operation="detect")

# handle_encoding with operation="convert" returns a UTF-8 string
utf8_text = handle_encoding(raw_bytes, operation="convert")

Normalizers

TextNormalizer takes config=None, **kwargs in its constructor. Normalization options are passed per-call to normalize_text():
from semantica.normalize import TextNormalizer

normalizer = TextNormalizer()

# normalize_text options
normalized = normalizer.normalize_text(
    raw_text,
    unicode_form="NFC",       # "NFC" | "NFD" | "NFKC" | "NFKD"
    case="preserve",          # "preserve" | "lower" | "upper" | "title"
    normalize_diacritics=False,
    line_break_type="unix",   # "unix" | "windows"
)

# HTML stripping and text cleaning: separate clean_text() method
cleaned = normalizer.clean_text(html_text, remove_html=True)

# Batch normalization
results = normalizer.process_batch(
    ["  hello  ", "WORLD", "café"],
    unicode_form="NFKC",
    case="lower",
)

# normalize() accepts str or List[Dict] (parsed docs from DocumentParser)
docs = [{"content": "Hello world"}, {"content": "test text"}]
normalized_docs = normalizer.normalize(docs)
normalize_text() parameterTypeDefaultDescription
unicode_formstr"NFC"Unicode form: "NFC" / "NFD" / "NFKC" / "NFKD"
casestr"preserve""preserve" / "lower" / "upper" / "title"
normalize_diacriticsboolFalseStrip diacritical marks
line_break_typestr"unix""unix" (\n) or "windows" (\r\n)
Unicode form guide:
FormUse When
NFCDefault: best for storage and display
NFKCSearch indexing: normalises ligatures, fullwidth chars, and fractions
NFDStripping diacritics: split é → e + combining accent, then strip accents
NFKDSame as NFD but also decomposes compatibility characters
Sub-normalizers for fine-grained control:
from semantica.normalize import (
    UnicodeNormalizer, WhitespaceNormalizer, SpecialCharacterProcessor
)

unicode_norm = UnicodeNormalizer()
text = unicode_norm.normalize_unicode("café", form="NFC")

ws_norm = WhitespaceNormalizer()
text    = ws_norm.normalize_whitespace("Hello\t\t World\n\n")
# → "Hello  World\n\n"

processor = SpecialCharacterProcessor()
text      = processor.normalize_punctuation("‘Hello’")
# → "'Hello'"

DataCleaner

Cleans structured record sets: useful before loading into a vector store or graph:
from semantica.normalize import DataCleaner, DataValidator, DuplicateDetector

cleaner = DataCleaner()

# clean_data: remove_duplicates, validate, and handle_missing in one pass
cleaned = cleaner.clean_data(
    records,
    remove_duplicates=True,
    validate=True,
    handle_missing=True,
    missing_strategy="remove",  # "remove" | "fill" | "impute"
)

# detect_duplicates: returns List[DuplicateGroup]
groups = cleaner.detect_duplicates(records, threshold=0.9)
for group in groups:
    print(f"Duplicate group: {len(group.records)} records, similarity={group.similarity_score:.2f}")
    print(f"  Canonical: {group.canonical_record}")

# handle_missing_values: standalone missing-value handling
processed = cleaner.handle_missing_values(records, strategy="fill", fill_value="")

# validate_data: validate against a schema dict
result = cleaner.validate_data(
    records,
    schema={"fields": {
        "name":   {"type": str,  "required": True},
        "age":    {"type": int,  "required": False},
        "active": {"type": bool, "required": False},
    }},
)
# ValidationResult has .valid (bool), .errors (list), .warnings (list)
print(f"Valid:    {result.valid}")
print(f"Errors:   {len(result.errors)}")
print(f"Warnings: {len(result.warnings)}")

DataCleaner Methods

MethodReturnsDescription
clean_data(dataset, remove_duplicates, validate, handle_missing, **options)List[Dict]Combined cleaning pipeline
detect_duplicates(dataset, threshold, key_fields)List[DuplicateGroup]Return duplicate groups above similarity threshold
validate_data(dataset, schema)ValidationResultValidate records against a schema dict
handle_missing_values(dataset, strategy)List[Dict]Remove, fill, or impute missing values
DataCleaner.remove_duplicates() does not exist as a standalone method. Use detect_duplicates() to get DuplicateGroup objects, or call clean_data(records, remove_duplicates=True) to remove them in-place.
DataCleaner operates on flat records, not graph entities. For entity-level semantic deduplication, use DuplicateDetector from the Deduplication module instead.

Pipeline Integration

from semantica.pipeline import PipelineBuilder, ExecutionEngine
from semantica.ingest import FileIngestor
from semantica.normalize import TextNormalizer
from semantica.semantic_extract import NERExtractor

ingestor   = FileIngestor()
normalizer = TextNormalizer()
extractor  = NERExtractor(method="ml")

builder = PipelineBuilder()
# TextNormalizer.normalize() accepts both str and List[Dict] from DocumentParser
builder.add_step("ingest",    "file_ingest",    handler=ingestor.ingest_file)
builder.add_step("normalize", "text_normalize", handler=normalizer.normalize)
builder.add_step("extract",   "ner_extract",    handler=extractor.extract)
builder.connect_steps("ingest",    "normalize")
builder.connect_steps("normalize", "extract")

pipeline = builder.build("normalize_pipeline")
result   = ExecutionEngine().execute_pipeline(pipeline, data="data/documents/")

Custom Normalizers

Register a custom normalizer in the method registry:
from semantica.normalize.registry import method_registry

def my_normalizer(text, **kwargs):
    return text.replace("Inc.", "Incorporated")

method_registry.register("text", "expand_suffixes", my_normalizer)

from semantica.normalize import normalize_text
normalized = normalize_text("Apple Inc.", method="expand_suffixes")
# → "Apple Incorporated"

Parse

Parse documents before normalization.

Split

Chunk normalized text for embedding.

Deduplication

Resolve duplicate entities after normalization.

Pipeline

Include normalization as a named pipeline step.