Register sources, build a foundation graph, validate quality, and merge with extracted data.
SeedDataSource
Typed source definition supporting CSV, JSON, SQL, and API with format-specific config.
Foundation Graph
Build a foundation graph from all registered sources in one pass, ready to merge with extracted data.
Merge Strategies
seed_first, extracted_first, and merge with property-level conflict detection.
Validation
Required field checks, ID uniqueness, type consistency, reference integrity, and encoding validation before loading.
Versioning
Track seed data versions across pipeline runs and diff changes between versions.
When to use the Seed Module: Bootstrapping with structured reference data (taxonomies, user lists, product catalogs), loading immutable facts (ISO country codes, standard ontology terms) that extracted data should not override, ensuring test reproducibility with deterministic datasets, and anchoring entity disambiguation with canonical forms.
Register all sources before calling create_foundation_graph().create_foundation_graph() processes all registered sources in one pass. Registering a source after calling it means that source is silently excluded. Register all sources at the start of your script, then call create_foundation_graph() once.
# Note: validate_quality expects a graph dict, but load_source returns a list.# For demonstration, validate the foundation graph instead.report = manager.validate_quality(foundation_kg)if not report["valid"]: for error in report["errors"]: print(f"Error: {error}") for warning in report["warnings"]: print(f"Warning: {warning}")else: print(f"Validated {report['metrics']['entity_count']} entities: no issues found")
Validate before loading.manager.validate_quality(seed_data) catches missing required fields, type inconsistencies, and duplicate IDs before they corrupt your graph. Running validation after loading means you’ll need to roll back. Validation is fast: always run it first.
4
Merge with extracted data
from semantica.semantic_extract import NERExtractorextractor = NERExtractor(method="ml")new_entities = extractor.extract("Apple Inc. partners with Microsoft Corp.")# Merge with seed data - note the correct parameter namesfinal_kg = manager.integrate_with_extracted( seed_data=foundation_kg, extracted_data={"entities": new_entities, "relationships": []}, merge_strategy="merge")
Load seed data before extracted data. Seed data is your ground truth: normalised, curated, and already de-duplicated. Load it first with create_foundation_graph(), then merge extracted entities on top. Merging in the wrong order lets noisy extracted data overwrite trusted reference values.
Use for production pipelines when both seed and extracted data are valuable.
Use seed_first merge strategy for reference data. When seed data encodes authoritative facts (official company names, canonical taxonomy IDs, employee records), merge_strategy="seed_first" ensures those values win over extracted values. Use merge only when extracted data may be more current than the seed.
Use YAML configuration for production deployments. Hard-coding source paths in Python scripts makes environment-switching (dev → staging → prod) fragile. Declare sources in config.yaml under the seed: key and override paths with SEMANTICA_SEED_DATA_DIR. This way, the same code runs in every environment.
Ingest
Load unstructured data alongside seed data.
Knowledge Graph
The target graph that seed data populates.
Deduplication
Handle duplicates during seed-extracted merge.
Pipeline
Incorporate seed loading as a named pipeline step.