ProvenanceManager is the central tracker for all lineage data. Every call to track_entity, track_relationship, or track_chunk automatically computes and stores a SHA-256 checksum for tamper detection.
ProvenanceManager( storage=None, # ProvenanceStorage instance; defaults to InMemoryStorage storage_path=None, # str path: creates SQLiteStorage if provided)
If both storage and storage_path are omitted, an InMemoryStorage is used.
InMemoryStorage does not persist across restarts. Pass storage_path="provenance.db" or an explicit SQLiteStorage instance in any environment where the audit trail must survive process exits.
# get_lineage returns a dict: not a ProvenanceEntrylineage = manager.get_lineage("apple_inc")print(lineage["entity_id"]) # "apple_inc"print(lineage["source_documents"]) # ["annual_report.pdf"]print(lineage["first_seen"]) # ISO timestamp stringprint(lineage["last_updated"]) # ISO timestamp stringprint(lineage["entity_count"]) # number of entries in chainprint(lineage["lineage_chain"]) # list of entry dicts (full history)print(lineage["metadata"]) # merged metadata dict# trace_lineage returns the raw ProvenanceEntry objectsentries = manager.trace_lineage("apple_inc")for entry in entries: print(entry.entity_id, entry.source_document, entry.confidence)# get_all_sources returns a list of source dictssources = manager.get_all_sources("apple_inc")for s in sources: print(s["source"], s["location"], s["confidence"])# get_provenance returns the most recent entry as a dict (or None)prov = manager.get_provenance("apple_inc")if prov: print(prov["source_document"])
get_lineage() returns an aggregated dict, not a ProvenanceEntry. Use trace_lineage() to get the raw ProvenanceEntry objects when you need field-level access such as entry.checksum.
Persists to disk. Suitable for production, audit trails, and regulatory compliance:
from semantica.provenance import SQLiteStorage, ProvenanceManagermanager = ProvenanceManager(storage=SQLiteStorage("provenance.db"))# Or use the shorthandmanager = ProvenanceManager(storage_path="provenance.db")
SQLiteStorage creates the database and indexes automatically on first use.
compute_checksum and verify_checksum are auto-used by track_entity and all other tracking methods. You can also call them directly:
from semantica.provenance import compute_checksum, verify_checksumentry = manager.trace_lineage("apple_inc")[0]# Recompute checksum from entry fieldschecksum = compute_checksum(entry)# Verify using stored checksum (entry.checksum)is_valid = verify_checksum(entry)# Or verify against a separately stored expected checksumis_valid = verify_checksum(entry, expected_checksum=checksum)if not is_valid: raise RuntimeError("Provenance record has been tampered with.")
The checksum covers entity_id, entity_type, activity_id, source_document, timestamp, and confidence.
Run verify_checksum(entry) before any compliance export. Pass the ProvenanceEntry object returned by trace_lineage() directly. If the stored checksum no longer matches, raise an error before the export proceeds.
BridgeAxiom and TranslationChain are available in semantica.provenance.bridge_axiom for tracking multi-layer domain translations with full coefficient attribution:
from semantica.provenance.bridge_axiom import BridgeAxiom, create_translation_chainfrom semantica.provenance import ProvenanceManagermanager = ProvenanceManager()# Define a bridge axiom with DOI-backed coefficientaxiom = BridgeAxiom( axiom_id="BA-001", name="biomass_tourism_elasticity", rule="1% biomass increase -> 0.346% tourism revenue increase", coefficient=0.346, source_doi="10.1038/s41586-021-03371-z", source_page="Table S4", confidence=0.92, input_domain="ecological", output_domain="financial",)# Apply to a value with provenance trackingresult = axiom.apply( input_entity="cabo_pulmo_biomass", input_value=463.0, prov_manager=manager,)print(result["output_value"]) # 463.0 * 0.346 = 160.098# Build a multi-step translation chaininput_data = {"entity_id": "cabo_pulmo", "value": 463.0, "source": "DOI:10.1371/..."}chain = create_translation_chain(input_data, [axiom], prov_manager=manager)print(chain.confidence) # 0.92
NERExtractor and other extractors accept provenance=True to embed provenance metadata on each extracted entity. You must track the results manually using ProvenanceManager:
from semantica.semantic_extract import NERExtractorfrom semantica.provenance import ProvenanceManagermanager = ProvenanceManager()ner = NERExtractor(method="ml", provenance=True)entities = ner.extract("Steve Jobs founded Apple Inc.")# Track each extracted entity manuallyfor entity in entities: manager.track_entity( entity_id=entity.id, source="source_document.txt", confidence=entity.confidence, entity_type=entity.type, )# Now retrieve lineagelineage = manager.get_lineage(entities[0].id)print(lineage["source_documents"])
Setting provenance=True on NERExtractor embeds metadata on the extracted entity objects — it does not automatically call ProvenanceManager.track_entity(). You must call track_entity() yourself after extraction.
from semantica.provenance import ProvenanceManagermanager = ProvenanceManager(storage_path="provenance.db")# Track an entity extracted from a documententry = manager.track_entity( entity_id="entity_001", source="report_2024.pdf", source_location="Page 5", source_quote="Revenue grew 12% year-over-year.", confidence=0.95,)# entry.checksum is set automatically# Retrieve full lineagelineage = manager.get_lineage("entity_001")print(lineage["source_documents"])
from semantica.provenance import ProvenanceManagermanager = ProvenanceManager()# Track chunks produced by the split modulemanager.track_chunk( chunk_id="chunk_0001", source_document="report.pdf", start_index=0, end_index=512,)# Batch-track all chunks at oncechunks = [ {"id": "c0", "start_index": 0, "end_index": 512}, {"id": "c1", "start_index": 512, "end_index": 1024},]count = manager.track_chunks_batch(chunks, source_document="report.pdf")
from semantica.provenance import ( ProvenanceManager, compute_checksum, verify_checksum)manager = ProvenanceManager(storage_path="provenance.db")manager.track_entity("e1", source="doc.pdf", confidence=0.9)entries = manager.trace_lineage("e1")entry = entries[0]# Verify the stored checksum is still validif not verify_checksum(entry): raise RuntimeError("Provenance tampered: " + entry.entity_id)
Tamper-evident checksums; timestamps on every entry
GDPR
Lineage graph supports data erasure impact analysis
FDA 21 CFR Part 11
Electronic record with timestamp, agent_id, activity_id, checksum
ProvenanceManager does not include built-in Turtle or JSON-LD serialization. Use entry.to_dict() and get_lineage() to retrieve provenance data, then serialize with your preferred RDF library if W3C PROV-O RDF output is required.