Broken bytes corrupt everything downstream. Always run this before anything else.
from semantica.normalize import EncodingHandlerhandler = EncodingHandler()# detect returns (encoding_name, confidence_score)encoding, confidence = handler.detect(raw_bytes)# convert_to_utf8 returns a strutf8_text = handler.convert_to_utf8(raw_bytes)
Run encoding repair before anything else. A single cp1252 character in a UTF-8 stream silently corrupts the surrounding text. Call handler.convert_to_utf8(raw_bytes) first, before any other normalizer sees the data.
2
TextNormalizer: unicode, whitespace, special chars
from semantica.normalize import TextNormalizernormalizer = TextNormalizer()# normalize_text takes per-call options, not constructor paramsclean_text = normalizer.normalize_text( utf8_text, unicode_form="NFC", case="preserve",)
Don’t lowercase before NER.normalize_text(text, case="lower") before entity extraction destroys capitalization signals that NER relies on. Apply case normalization only after extraction if needed.
3
EntityNormalizer: canonicalize entity names
from semantica.normalize import EntityNormalizer# Provide aliases in config so the resolver can map variantsnormalizer = EntityNormalizer(alias_map={ "apple computer inc.": "Apple Inc.", "apple computer, inc.": "Apple Inc.",})canonical = normalizer.normalize_entity( "Apple Computer, Inc.", entity_type="Organization")# → "Apple Inc." (if the alias_map contains it, else title-cased input)
EntityNormalizer has no built-in corporate suffix expansion. There is no automatic mapping of "Apple Computer Inc." → "Apple Inc.". To canonicalize corporate names, provide an explicit alias_map with lowercase keys: EntityNormalizer(alias_map={"apple computer inc.": "Apple Inc."}).
4
DateNormalizer and NumberNormalizer: parse structured values
from semantica.normalize import DateNormalizer, NumberNormalizerdate_norm = DateNormalizer()num_norm = NumberNormalizer()# format and timezone are passed to normalize_date(), not to the constructordate = date_norm.normalize_date("Jan 1st, 2020", format="ISO8601", timezone="UTC")# → "2020-01-01T00:00:00+00:00"num = num_norm.normalize_number("$1.2B")# → 1200000000.0
5
LanguageDetector: detect language on clean text
from semantica.normalize import LanguageDetectordetector = LanguageDetector()# detect() returns a language code stringlang = detector.detect("Bonjour le monde")# → "fr"# detect_with_confidence() returns (code, score) tuplelang, confidence = detector.detect_with_confidence("Bonjour le monde")# → ("fr", 0.98)
EntityNormalizer performs: whitespace cleanup, optional alias resolution
(requires an explicit alias_map), and name format normalization.
from semantica.normalize import EntityNormalizer# With alias map: resolves exact matches (lowercase key lookup)normalizer = EntityNormalizer(alias_map={ "apple computer inc.": "Apple Inc.", "ms": "Microsoft", "ml": "Machine Learning",})normalizer.normalize_entity("Apple Computer Inc.", entity_type="Organization")# → "Apple Inc."# Without alias map: only whitespace/format cleanupnormalizer2 = EntityNormalizer()normalizer2.normalize_entity("apple inc", entity_type="Organization")# → "apple inc" (no built-in suffix expansion)# Person: title-casednormalizer2.normalize_entity("john doe", entity_type="Person")# → "John Doe"
Key behaviours:
Alias map uses lowercase key lookup: register aliases in lowercase
entity_type="Person" activates title() casing on the name
There is no built-in corporate suffix normalization (Inc → Incorporated etc.)
: add these mappings to alias_map manually if needed
Sub-normalizers:
from semantica.normalize import AliasResolver, EntityDisambiguator, NameVariantHandler# alias_map keys must be lowercaseresolver = AliasResolver(alias_map={ "ml": "Machine Learning", "nlp": "Natural Language Processing",})resolved = resolver.resolve_aliases("ml")# → "Machine Learning" or None if not in mapdisambiguator = EntityDisambiguator()result = disambiguator.disambiguate( "Apple", entity_type="Organization", context="Steve Jobs founded Apple in Cupertino in 1976",)# → {"entity_name": "Apple", "entity_type": "Organization", "confidence": 0.8, "candidates": ["Apple"]}handler = NameVariantHandler()canonical = handler.normalize_name_format("Dr. JOHN P. SMITH Jr.")# → "John P. Smith Jr." (removes leading title)
AliasResolver uses lowercase key lookup. Register aliases with lowercase keys even if the canonical form is title-cased. The resolver converts the input to lowercase before lookup.
DateNormalizer takes config=None, **kwargs. The format and timezone
options are passed to normalize_date(), not the constructor:
from semantica.normalize import DateNormalizernormalizer = DateNormalizer()dates = [ "January 1st, 2020", "01/01/2020", "2020-01-01T00:00:00Z", "yesterday", "3 weeks ago",]for d in dates: print(normalizer.normalize_date(d, format="ISO8601", timezone="UTC"))
Requires python-dateutil: pip install python-dateutil. Falls back to
datetime.fromisoformat() if not installed.Sub-normalizers:
from semantica.normalize import TimeZoneNormalizer, RelativeDateProcessor, TemporalExpressionParserfrom datetime import datetime# TimeZoneNormalizer takes a datetime object, not a stringtz_norm = TimeZoneNormalizer()dt_naive = datetime(2024, 1, 1, 9, 0)utc_dt = tz_norm.convert_to_utc(dt_naive)tz_dt = tz_norm.normalize_timezone(dt_naive, target_timezone="America/New_York")# RelativeDateProcessor: reference_date is passed to process_relative_expression(),# not to the constructorprocessor = RelativeDateProcessor()ref = datetime(2025, 1, 15)result = processor.process_relative_expression("3 days ago", reference_date=ref)# → datetime(2025, 1, 12)parser = TemporalExpressionParser()result = parser.parse_temporal_expression("from January 2020 to March 2021")# → {"date": ..., "time": ..., "range": {"start": ..., "end": ...}, "relative": False}
Converts number strings with units, currencies, and abbreviations to int or float:
Identify the language of a text string. Requires langdetect: pip install langdetect.
from semantica.normalize import LanguageDetectordetector = LanguageDetector()# detect() returns a language code stringlang = detector.detect("Bonjour le monde")# → "fr"# detect_with_confidence() returns (code, confidence) tuplelang, confidence = detector.detect_with_confidence("Bonjour le monde")# → ("fr", 0.98)# detect_multiple() returns List[(code, confidence)]results = detector.detect_multiple("This might be mixed", top_n=3)# → [("en", 0.85), ...]# Batch: returns List[str]codes = detector.detect_batch(["Hello", "Hola", "Bonjour", "Ciao"])# Check specific languageis_english = detector.is_language(text, "en", min_confidence=0.8)
detect() requires at least 10 characters for reliable detection. On shorter text it returns the default_language (default: "en").
LanguageDetector.detect() returns a str, not a dict. Use detect_with_confidence() for (language_code, confidence) tuple, or detect_multiple() for List[(code, confidence)].
Detect and repair character encoding issues. Requires chardet: pip install chardet.
from semantica.normalize import EncodingHandlerhandler = EncodingHandler()# detect() returns (encoding_name, confidence_score) tupleencoding, confidence = handler.detect(raw_bytes)# → ("windows-1252", 0.73)# convert_to_utf8() returns a strutf8_text = handler.convert_to_utf8(raw_bytes)utf8_text = handler.convert_to_utf8(raw_bytes, source_encoding="cp1252")# remove_bom() returns same type as input (str or bytes) with BOM strippedclean = handler.remove_bom(text_with_bom)# Detect and convert a file on diskutf8_content = handler.convert_file_to_utf8("input.txt", output_path="output.txt")
Key behaviours:
detect() uses chardet internally: accuracy improves with longer input
convert_to_utf8() auto-detects encoding if source_encoding is not provided,
then falls back through latin-1, cp1252, iso-8859-1
Always run EncodingHandler first: broken bytes cause cascading failures
in every downstream normalizer
EncodingHandler.detect() returns a (str, float) tuple, not a dict. Unpack with encoding, confidence = handler.detect(data).
Return duplicate groups above similarity threshold
validate_data(dataset, schema)
ValidationResult
Validate records against a schema dict
handle_missing_values(dataset, strategy)
List[Dict]
Remove, fill, or impute missing values
DataCleaner.remove_duplicates() does not exist as a standalone method. Use detect_duplicates() to get DuplicateGroup objects, or call clean_data(records, remove_duplicates=True) to remove them in-place.
DataCleaner operates on flat records, not graph entities. For entity-level semantic deduplication, use DuplicateDetector from the Deduplication module instead.