ProvenanceManager records a W3C PROV-O compliant entry for every entity, relationship, document chunk, and property value — with a SHA-256 checksum for tamper detection and automatic version chaining on every track_entity() call. Use it when you need to answer regulatory questions about where a value came from, who wrote it, and whether it has changed since first ingestion.
The KG pipeline auto-calls track_entity() and track_relationship() on everything it extracts, so entities that enter through the standard pipeline are already tracked. Use the manual API covered here when you need custom audit integrations, cross-module lineage chains, or fine-grained property-level attribution across multiple sources.

Setting up the provenance store

ProvenanceManager supports two storage backends. In-memory storage is zero-dependency and useful for testing. SQLite storage persists across restarts, supports concurrent reads, and gives your compliance team a standard database they can query directly.
from semantica.provenance import ProvenanceManager

# In-memory — session only, no persistence
prov = ProvenanceManager()

# SQLite — persists to disk, free concurrent reads
prov = ProvenanceManager(storage_path="provenance.db")

# Custom backend — pass any ProvenanceStorage implementation
from semantica.provenance.storage import SQLiteStorage
prov = ProvenanceManager(storage=SQLiteStorage("audit.db"))
For any regulated deployment — security operations, clinical data, financial risk — use storage_path. A SQLite file can be backed up, versioned, and queried with standard tools without requiring a server.

Recording provenance when ingesting data

The moment data enters your graph is the moment provenance must be recorded. track_entity() captures the source document, the timestamp, the operator or pipeline that ran the extraction, a verbatim quote from the source, and a confidence score. It returns a ProvenanceEntry with a SHA-256 checksum computed automatically.
# Ingesting CVE-2024-3400 from NVD and a commercial feed
# Both are tracked separately so the full multi-source picture is preserved.

entry_nvd = prov.track_entity(
    entity_id="cve-2024-3400",
    source="NVD_feed_2024-04-12",
    metadata={
        "cvss_score": 10.0,
        "vector": "AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H",
        "exploit_status": "unconfirmed",
    },
    confidence=0.98,
    entity_type="vulnerability",
    activity_id="nvd_feed_ingestion",
    source_location="CVE-2024-3400 JSON record",
    source_quote='{"cvssMetricV31":[{"cvssData":{"baseScore":10.0}}]}',
    agent_id="nvd_ingest_pipeline_v2",
)

print(f"Entity tracked : {entry_nvd.entity_id}")
print(f"Source         : {entry_nvd.source_document}")
print(f"Timestamp      : {entry_nvd.timestamp}")
print(f"Checksum       : {entry_nvd.checksum}")       # SHA-256 hex digest
print(f"First seen     : {entry_nvd.first_seen}")     # set on first call only
Entity tracked : cve-2024-3400
Source         : NVD_feed_2024-04-12
Timestamp      : 2024-04-12T14:22:07.881Z
Checksum       : 3f7a9c2d...                          # tamper-detectable
First seen     : 2024-04-12T14:22:07.881Z
Now track the same entity arriving from the commercial feed an hour later. Calling track_entity() again on the same entity_id automatically archives the NVD entry as a history record and creates a new current entry linked to it via parent_entity_id:
entry_commercial = prov.track_entity(
    entity_id="cve-2024-3400",
    source="commercial_feed_2024-04-12",
    metadata={
        "cvss_score": 9.8,
        "exploit_status": "in_wild",
        "observed_exploitation": True,
    },
    confidence=0.91,
    entity_type="vulnerability",
    activity_id="commercial_feed_ingestion",
    agent_id="threat_ingest_pipeline_v2",
)

# The NVD entry is now archived as cve-2024-3400:v:2024-04-12T14:22:07
# The commercial entry is the new current state
# entry_commercial.parent_entity_id == "cve-2024-3400:v:2024-04-12T14:22:07"
This version chaining happens automatically. You do not need to manage history entries manually.

Tracking multi-source property values

When the same property appears in multiple sources with different values — exactly the CVE score situation — use track_property_source() to record each attribution separately. This feeds directly into conflict detection downstream: the conflict module can compare all tracked values for a property and surface disagreements with full source metadata attached.
from semantica.provenance.schemas import SourceReference

nvd_ref = SourceReference(
    document="NVD_feed_2024-04-12",
    section="cvssMetricV31",
    confidence=0.98,
    metadata={"publisher": "NIST", "feed_type": "NVD"},
)

commercial_ref = SourceReference(
    document="commercial_feed_2024-04-12",
    section="cvss_assessment",
    confidence=0.91,
    metadata={"publisher": "ThreatFeed-Co", "observed": True},
)

# Track each source's value separately under the same entity + property key
prov.track_property_source("cve-2024-3400", "cvss_score", 10.0, nvd_ref)
prov.track_property_source("cve-2024-3400", "cvss_score", 9.8,  commercial_ref)

# Property sources are stored under "<entity_id>_<property_name>"
# Later: retrieve all sources for this property to answer "where did 9.8 come from?"
sources = prov.get_all_sources("cve-2024-3400_cvss_score")
for s in sources:
    print(f"{s['source']:<35}  confidence={s['confidence']:.2f}  loc={s['location'] or '—'}")
NVD_feed_2024-04-12                   confidence=0.98  loc=cvssMetricV31
commercial_feed_2024-04-12            confidence=0.91  loc=cvss_assessment
When the regulator asks “where did the 9.8 come from?”, this is the answer: commercial_feed_2024-04-12, section cvss_assessment, confidence 0.91, with full metadata showing it was a commercial publisher that reported observed exploitation.

Tracing the lineage of a node

Six months after ingestion, run a lineage trace. get_lineage() returns the full version chain — every state the entity has passed through, oldest to newest — along with summary metadata:
lineage = prov.get_lineage("cve-2024-3400")

print(f"Entity      : {lineage['entity_id']}")
print(f"First seen  : {lineage['first_seen']}")
print(f"Last updated: {lineage['last_updated']}")
print(f"History depth: {lineage['entity_count']} entries")
print(f"Sources seen : {lineage['source_documents']}")
print()
print("Full version chain (oldest → newest):")
for entry in lineage["lineage_chain"]:
    print(f"  [{entry['timestamp'][:19]}]  agent={entry['agent_id']}")
    print(f"    source={entry['source_document']}")
    print(f"    activity={entry['activity_id']}")
Entity      : cve-2024-3400
First seen  : 2024-04-12T14:22:07.881Z
Last updated: 2024-10-08T09:11:44.302Z
History depth: 4 entries
Sources seen : ['NVD_feed_2024-04-12', 'commercial_feed_2024-04-12',
                'NVD_feed_2024-07-18', 'commercial_feed_2024-10-08']

Full version chain (oldest → newest):
  [2024-04-12T14:22:07]  agent=nvd_ingest_pipeline_v2
    source=NVD_feed_2024-04-12
    activity=nvd_feed_ingestion
  [2024-04-12T15:18:33]  agent=threat_ingest_pipeline_v2
    source=commercial_feed_2024-04-12
    activity=commercial_feed_ingestion
  [2024-07-18T08:04:11]  agent=nvd_ingest_pipeline_v2
    source=NVD_feed_2024-07-18
    activity=nvd_feed_ingestion       # NVD updated their score
  [2024-10-08T09:11:44]  agent=threat_ingest_pipeline_v2
    source=commercial_feed_2024-10-08
    activity=commercial_feed_ingestion
The chain answers all three of the regulator’s questions. The 9.8 came from commercial_feed_2024-04-12. The operator was threat_ingest_pipeline_v2. The score has changed — NVD updated their record on July 18 — and the chain shows exactly when.

Verifying integrity

Every ProvenanceEntry carries a SHA-256 checksum computed at write time. If any field is modified after the fact — by a misconfigured pipeline, a database migration, or deliberate tampering — the checksum will not match on recomputation. Run integrity checks as part of any compliance audit:
from semantica.provenance.integrity import compute_checksum

raw_entries = prov.trace_lineage("cve-2024-3400")

print("Integrity check:")
for e in raw_entries:
    stored   = e.checksum
    computed = compute_checksum(e)
    status   = "OK" if stored == computed else "TAMPERED"
    print(f"  [{status}] {e.entity_id[:50]}  {(stored or '')[:16]}...")
Integrity check:
  [OK] cve-2024-3400                                 3f7a9c2d...
  [OK] cve-2024-3400:v:2024-04-12T14:22:07           a1b2c3d4...
  [OK] cve-2024-3400:v:2024-04-12T15:18:33           e5f6a7b8...
  [OK] cve-2024-3400:v:2024-07-18T08:04:11           c9d0e1f2...
A TAMPERED status means the stored hash does not match what would be computed from the current field values — evidence of post-write modification that must be investigated before the record is used for compliance purposes.

Tracking document chunks and their children

Provenance is not just for entities. When a document is split into chunks for RAG or NLP processing, each chunk needs its own provenance record linking it to the source file and byte range. Child chunks (from recursive splitting) link to their parent via parent_chunk_id, which maps to prov:wasDerivedFrom in the W3C model:
# Track the parent chunk (a section of an advisory PDF)
prov.track_chunk(
    chunk_id="advisory_section_3",
    source_document="CISA_advisory_AA24-099A.pdf",
    source_path="/feeds/cisa/advisories/AA24-099A.pdf",
    start_index=4096,
    end_index=8192,
)

# Track a child chunk derived from recursive splitting
prov.track_chunk(
    chunk_id="advisory_section_3a",
    source_document="CISA_advisory_AA24-099A.pdf",
    source_path="/feeds/cisa/advisories/AA24-099A.pdf",
    start_index=4096,
    end_index=6144,
    parent_chunk_id="advisory_section_3",   # prov:wasDerivedFrom
)

# Retrieve the provenance record for a chunk
record = prov.get_provenance("advisory_section_3a")
if record:
    print(f"Source   : {record['source_document']}")
    print(f"Range    : bytes {record['start_index']}{record['end_index']}")
    print(f"Parent   : {record['parent_entity_id']}")  # advisory_section_3
    print(f"Checksum : {record['checksum']}")
For GDPR right-of-erasure workflows, the byte range in each chunk’s provenance record tells you exactly which part of which document to delete when a data subject makes a deletion request.

Statistics across the provenance store

After a large ingestion run, get_statistics() gives a summary of everything tracked:
stats = prov.get_statistics()

print(f"Total tracked    : {stats['total_entries']}")
print(f"By entity type   : {stats['entity_types']}")
print(f"Unique sources   : {stats['unique_sources']}")
Total tracked    : 14,822
By entity type   : {'vulnerability': 3041, 'chunk': 8204, 'relationship': 2891,
                    'property': 686}
Unique sources   : 12
This summary is the starting point for a compliance attestation: you can state the total number of tracked records, the number of distinct data sources, and the breakdown by record type.

Domain examples

A signals intelligence fusion cell tracks custody of every intelligence entity from raw collection through analytic processing to finished product. Each tier of the chain — raw collection, NER extraction, fusion, and finished intelligence — must be recorded separately with the appropriate classification handling and operator identity. The provenance chain is the chain of custody: it proves that a finished intelligence product is traceable to authorized collection and authorized analysis at every step.Under ITAR and intelligence community sharing agreements, the provenance record must show which collection method produced the raw data, which analyst processed it, and which fusion activity combined it with other intelligence before the entity reached the finished product. track_chunk(), track_entity(), and track_relationship() each correspond to one tier of that chain.
from semantica.provenance import ProvenanceManager
from semantica.provenance.schemas import SourceReference

prov = ProvenanceManager(storage_path="intel_provenance.db")

# Tier 1: Raw collection
prov.track_chunk(
    chunk_id="osint_collection_20260621_0442Z",
    source_document="COLLECTION_TASKING_TK-2026-0192",
    source_path="/osint/raw/20260621_0442Z.txt",
    start_index=0,
    end_index=2048,
    classification="UNCLASSIFIED//FOUO",
    collection_method="OSINT",
    collector_id="STATION_ECHO",
)

# Tier 2: Entity extracted from collection
prov.track_entity(
    entity_id="threat_actor_DELTA9",
    source="osint_collection_20260621_0442Z",
    metadata={"label": "THREAT_ACTOR", "confidence_level": "C2"},
    confidence=0.87,
    entity_type="threat_actor",
    activity_id="ner_extraction",
    source_location="paragraph_3",
    agent_id="analyst_ALPHA",
)

# Tier 3: Campaign relationship from all-source fusion
prov.track_relationship(
    relationship_id="DELTA9_operates_CAMPAIGN_IRON",
    source="FUSION_REPORT_FP-2026-0447",
    metadata={"type": "operates", "confidence": 0.81},
    confidence=0.81,
    activity_id="all_source_fusion",
    agent_id="fusion_cell_BRAVO",
)

# Tier 4: Property from two independent INT sources
humint_src = SourceReference(
    document="HUMINT_REPORT_HR-2026-0821",
    confidence=0.91,
    metadata={"classification": "SECRET", "source_country": "PARTNER_5EYES"},
)
imint_src = SourceReference(
    document="IMINT_PRODUCT_IP-2026-1104",
    page=3,
    section="Ground Truth Assessment",
    confidence=0.87,
    metadata={"sensor": "OPIR", "resolution_m": 0.3},
)
prov.track_property_source("DELTA9", "location_country", "COUNTRY_X", humint_src)
prov.track_property_source("DELTA9", "location_country", "COUNTRY_X", imint_src)

# Finished product audit — full chain of custody
lineage = prov.get_lineage("threat_actor_DELTA9")
print("Chain of custody:")
for entry in lineage["lineage_chain"]:
    print(f"  [{entry['timestamp'][:19]}]  {entry['activity_id']}  agent={entry['agent_id']}")

# Corroborating INT sources for the location assessment
sources = prov.get_all_sources("DELTA9_location_country")
for s in sources:
    print(f"  INT source: {s['source']}  (conf={s['confidence']:.2f})")

The W3C PROV-O mapping

Every ProvenanceEntry maps directly to W3C PROV-O terms. If your compliance team or a regulator requires a PROV-O export, the field mapping is one-to-one:
PROV-O TermProvenanceEntry fieldWhat it records
prov:Entityentity_idThe tracked object — entity, chunk, relationship, or property
prov:Activityactivity_idThe process that produced it — "ner_extraction", "bureau_parsing"
prov:Agentagent_idWho ran the activity — pipeline name, analyst ID
prov:wasDerivedFromparent_entity_idThe previous version of this entity — enables version chaining
prov:usedused_entitiesEntity IDs consumed to produce this one
prov:generatedAtTimetimestampISO datetime, auto-set to datetime.utcnow() at write time
The checksum field is not part of the PROV-O standard — it is Semantica’s tamper-detection extension. Every entry’s SHA-256 is computed from its content fields at write time and can be recomputed at any time to verify the record has not been modified.
  • Semantic Extraction — the NER and relation extraction pipeline that auto-generates provenance entries for every extracted entity
  • Conflict Resolution — provenance property sources feed directly into conflict detection; every resolved value is traceable to its source
  • Deduplication — merge operations are recorded in merge history; pair with provenance for a complete lineage from source to canonical entity
  • Provenance Reference — full storage backend API, InMemoryStorage, SQLiteStorage, and ProvenanceEntry schema