semantica.deduplication detects and merges duplicate entities across sources to produce a clean, single-source-of-truth knowledge graph:
  • Four v2 strategies up to 7× faster than v1: blocking_v2, hybrid_v2, semantic_v2
  • ClusterBuilder uses Union-Find and hierarchical clustering for batch deduplication at scale
  • EntityMerger preserves original source provenance on every merged entity
  • MergeStrategyManager supports per-property rules and conflict resolution
  • All workflows operate on plain Python dicts: no ORM or schema required

Exported Classes

ClassRole
DuplicateDetectorPairwise and batch detection: returns DuplicateCandidate or DuplicateGroup lists
EntityMergerMerge duplicate groups: returns List[MergeOperation]
SimilarityCalculatorMulti-factor similarity: string, property, relationship, and embedding
ClusterBuilderUnion-Find and hierarchical clustering for large-scale batch deduplication
MergeStrategyEnum of merge strategies: KEEP_FIRST, KEEP_LAST, KEEP_MOST_COMPLETE, KEEP_HIGHEST_CONFIDENCE, MERGE_ALL
PropertyMergeRuleDataclass holding per-property merge rule: {property_name, strategy, conflict_resolution, priority}
MergeStrategyManagerManage and apply named merge strategies; accepts per-property rules
detect_duplicates()Convenience function: detect_duplicates(entities, method="pairwise", similarity_threshold=0.7)
merge_entities()Convenience function: merge_entities(entities, method="keep_most_complete")
calculate_similarity()Convenience function: calculate_similarity(entity_a, entity_b, method="multi_factor")

What You Get

DuplicateDetector

Pairwise, batch, incremental, and group detection modes. Returns scored candidates with reasons.

EntityMerger

Five merge strategies: keep first, last, most complete, highest confidence, or merge all fields.

SimilarityCalculator

Multi-factor scoring across string edit distance, property overlap, relationship overlap, and embeddings.

ClusterBuilder

Union-Find and hierarchical clustering for batch deduplication at scale: handles 100k+ entity sets.

MergeStrategyManager

Per-property merge rules with conflict resolution priorities. Apply different strategies to different fields.

v2 Strategies

blocking_v2, hybrid_v2, semantic_v2: up to 7× faster than v1 for large entity sets.

Getting Started

from semantica.deduplication import DuplicateDetector, EntityMerger

entities = [
    {"id": "1", "name": "Apple Inc.",   "type": "Company"},
    {"id": "2", "name": "Apple",        "type": "Company"},
    {"id": "3", "name": "Microsoft",    "type": "Company"},
]

# 1. Detect duplicates: returns List[DuplicateCandidate]
detector   = DuplicateDetector(similarity_threshold=0.7)
candidates = detector.detect_duplicates(entities)

for dup in candidates:
    print(
        "{} vs {}: sim: {:.2f}, confidence: {:.2f}".format(
            dup.entity1.get("name"),
            dup.entity2.get("name"),
            dup.similarity_score,
            dup.confidence,
        )
    )

# 2. Merge duplicates: returns List[MergeOperation]
merger     = EntityMerger()
operations = merger.merge_duplicates(entities, strategy="keep_most_complete")

for op in operations:
    print("Merged {} entities → {}".format(
        len(op.source_entities), op.merged_entity.get("name")
    ))
Normalize entity names before deduplication. Canonical forms such as "Apple Inc." vs "apple inc" may score below threshold due to case alone. Run EntityNormalizer or TextNormalizer first for reliable matching.

DuplicateDetector

Find duplicate entity pairs:
from semantica.deduplication import DuplicateDetector

detector = DuplicateDetector(
    similarity_threshold=0.7,    # default 0.7: minimum score to include a candidate
    confidence_threshold=0.6,    # default 0.6: minimum confidence to include a candidate
    max_results=100,             # optional: hard cap on total candidates returned
    top_k_per_entity=3,          # optional: max candidates per entity
    min_similarity=0.75,         # optional: additional floor applied after sorting
    sort_by="confidence",        # "confidence" (default) | "similarity_score"
)

# Returns List[DuplicateCandidate]
candidates = detector.detect_duplicates(entities)

for c in candidates:
    print(c.entity1.get("name"), "vs", c.entity2.get("name"))
    print("  similarity_score:", c.similarity_score)
    print("  confidence:      ", c.confidence)
    print("  reasons:         ", c.reasons)

# Detect duplicate groups with union-find (returns List[DuplicateGroup])
groups = detector.detect_duplicate_groups(entities)
for g in groups:
    print("Group of {}: confidence: {:.2f}: representative: {}".format(
        len(g.entities), g.confidence, g.representative and g.representative.get("name")
    ))

# Incremental: compare new entities against existing (returns List[DuplicateCandidate])
new_entities = [{"id": "4", "name": "Apple Corp.", "type": "Company"}]
candidates   = detector.incremental_detect(new_entities, entities)
Tune similarity_threshold before confidence_threshold. The similarity threshold gates which entity pairs are even considered. The confidence threshold further filters those pairs based on multi-factor scoring. Start with similarity_threshold=0.7 and raise it to reduce false positives.
Use detect_duplicate_groups() when you need to merge. The "group" detection strategy uses union-find to form transitive clusters: if A≈B and B≈C, all three land in the same group. Plain detect_duplicates() returns individual pairs without transitivity.

detect_duplicates() detection methods

The method= parameter of the detect_duplicates() convenience function controls how the comparison is performed. These are independent of the SimilarityCalculator string method used internally:
method=AlgorithmReturns
"pairwise" (default)O(n²) all-pairs comparisonList[DuplicateCandidate]
"batch"Batch similarity calculationList[DuplicateCandidate]
"incremental"New vs existing entitiesList[DuplicateCandidate]
"group"Union-find group formationList[DuplicateGroup]

DuplicateCandidate fields

FieldTypeDescription
entity1DictFirst entity
entity2DictSecond entity
similarity_scorefloatSimilarity score (0–1)
confidencefloatConfidence score (0–1)
reasonsList[str]Why they are considered duplicates
metadataDictAdditional metadata
DuplicateCandidate fields are entity1, entity2, similarity_score: not entity_a, entity_b, similarity. Accessing the wrong field names raises AttributeError.

DuplicateGroup fields

FieldTypeDescription
entitiesList[Dict]All entities in the group
similarity_scoresDictPair → score mapping
representativeOptional[Dict]Most complete entity in group
confidencefloatGroup confidence score
metadataDictAdditional metadata

EntityMerger

Merges detected duplicate groups into canonical entities:
from semantica.deduplication import EntityMerger, MergeStrategy

# Basic usage
merger     = EntityMerger(preserve_provenance=True)
operations = merger.merge_duplicates(entities, strategy="keep_most_complete")

# operations is List[MergeOperation]
for op in operations:
    merged = op.merged_entity         # the resulting merged entity dict
    sources = op.source_entities      # list of original entities that were merged
    conflicts = op.merge_result.conflicts  # unresolved conflicts, if any

# Merge a known group directly (no duplicate detection)
pair = [
    {"id": "1", "name": "Apple Inc.", "type": "Company"},
    {"id": "2", "name": "Apple",      "type": "Company"},
]
op = merger.merge_entity_group(pair, strategy="keep_most_complete")
print(op.merged_entity)

# Retrieve full merge history
history = merger.get_merge_history()
print("Total merges performed:", len(history))
merge_entities() and EntityMerger.merge_duplicates() return List[MergeOperation], not a list of entity dicts. Access .merged_entity on each operation to get the merged dict.

Merge strategies

Pass as a string to strategy= on merge_duplicates() or merge_entity_group():
StrategyBehavior
"keep_first"Keep the first entity in each duplicate group
"keep_last"Keep the most recently seen entity
"keep_most_complete"Keep the entity with the most non-null properties + relationships
"keep_highest_confidence"Keep the entity with the highest .confidence value
"merge_all"Combine all properties: conflicts resolved to lists

Per-property merge rules

Per-property rules are set on EntityMerger.merge_strategy_manager using add_property_rule(). Rules take a MergeStrategy enum value:
from semantica.deduplication import EntityMerger, MergeStrategy

merger = EntityMerger()

# Add per-property rules
merger.merge_strategy_manager.add_property_rule(
    "name", MergeStrategy.KEEP_FIRST
)
merger.merge_strategy_manager.add_property_rule(
    "aliases", MergeStrategy.MERGE_ALL
)

# Custom conflict resolution function
def keep_longest(val1, val2):
    return val1 if len(str(val1)) >= len(str(val2)) else val2

merger.merge_strategy_manager.add_property_rule(
    "description", MergeStrategy.KEEP_FIRST, conflict_resolution=keep_longest
)

operations = merger.merge_duplicates(entities)
PropertyMergeRule is a dataclass, not an Enum. The merge strategy Enum is MergeStrategy (KEEP_FIRST, KEEP_LAST, KEEP_MOST_COMPLETE, KEEP_HIGHEST_CONFIDENCE, MERGE_ALL). Per-property rules are added via merger.merge_strategy_manager.add_property_rule(name, strategy).

MergeOperation fields

FieldTypeDescription
source_entitiesList[Dict]Original entities that were merged
merged_entityDictResulting merged entity
merge_resultMergeResultDetailed result with conflicts
metadataDictGroup confidence, similarity scores, strategy used

SimilarityCalculator

Compute multi-factor similarity scores between entity pairs:
from semantica.deduplication import SimilarityCalculator

calc = SimilarityCalculator(
    string_weight=0.6,        # default 0.6
    property_weight=0.2,      # default 0.2
    relationship_weight=0.2,  # default 0.2
    embedding_weight=0.0,     # default 0.0 (used only when "embedding" key is present)
    similarity_threshold=0.7,
)

result = calc.calculate_similarity(entity_a, entity_b)
# result is a SimilarityResult
print(result.score)                         # overall score 0.0–1.0
print(result.method)                        # e.g. "multi_factor"
print(result.components["string"])          # string similarity component
print(result.components["property"])        # property overlap component
print(result.components["relationship"])    # relationship jaccard component
# result.components["embedding"] is present only when embeddings are supplied

# String similarity methods: method= accepts "levenshtein", "jaro_winkler", "cosine"
lev  = calc.calculate_string_similarity("Apple Inc.", "Apple Inc",  method="levenshtein")
jaro = calc.calculate_string_similarity("Steve Jobs", "Steven Jobs", method="jaro_winkler")
cos  = calc.calculate_string_similarity("apple",     "apples",      method="cosine")

# Embedding cosine similarity (vector inputs)
emb_score = calc.calculate_embedding_similarity(embedding_a, embedding_b)

# Property and relationship similarity (entity dict inputs)
prop_score = calc.calculate_property_similarity(entity_a, entity_b)
rel_score  = calc.calculate_relationship_similarity(entity_a, entity_b)

SimilarityResult fields

FieldTypeDescription
scorefloatOverall weighted similarity score (0–1)
methodstrMethod used (e.g. "multi_factor", "levenshtein")
componentsDict[str, float]Per-component scores: "string", "property", "relationship", "embedding"
metadataDictWeights used and optional score breakdown

ClusterBuilder

Build entity clusters for large-scale batch deduplication:
from semantica.deduplication import ClusterBuilder

builder = ClusterBuilder(
    similarity_threshold=0.8,  # minimum similarity to be in same cluster
    min_cluster_size=2,         # minimum entities per valid cluster
    max_cluster_size=100,       # maximum entities per cluster
    use_hierarchical=False,     # True for hierarchical, False (default) for union-find
)
result = builder.build_clusters(entities)

print("Clusters found:", len(result.clusters))
for cluster in result.clusters:
    print("  [{}] {} entities: quality: {:.2f}".format(
        cluster.cluster_id,
        len(cluster.entities),
        cluster.quality_score,
    ))

print("Unclustered:   ", len(result.unclustered))
print("Quality metrics:", result.quality_metrics)
# {"average_size": ..., "average_quality": ..., "total_clusters": ..., "high_quality_clusters": ...}

Cluster fields

FieldTypeDescription
cluster_idstrUnique cluster identifier
entitiesList[Dict]Entities in the cluster
centroidOptional[Dict]Representative entity (optional)
quality_scorefloatIntra-cluster average similarity
metadataDictSimilarity scores and other metadata

Convenience Functions

from semantica.deduplication import detect_duplicates, merge_entities, calculate_similarity

# Detect: method= accepts "pairwise" (default), "batch", "incremental", "group"
candidates = detect_duplicates(
    entities,
    method="pairwise",
    similarity_threshold=0.8,
    confidence_threshold=0.6,
)

# Merge: method= accepts the strategy strings, same as EntityMerger
operations = merge_entities(entities, method="keep_most_complete", preserve_provenance=True)
# Returns List[MergeOperation]; access .merged_entity on each

# Similarity: method= accepts "exact", "levenshtein", "jaro_winkler", "cosine",
#              "property", "relationship", "embedding", "multi_factor" (default)
result = calculate_similarity(entity_a, entity_b, method="multi_factor")
print(result.score)

Custom Similarity Functions

Register domain-specific similarity logic and use it via the method registry:
from semantica.deduplication import method_registry, SimilarityResult

def drug_name_similarity(entity_a, entity_b, **kwargs):
    """Match drug names by active compound prefix."""
    name_a = entity_a.get("name", "").lower()
    name_b = entity_b.get("name", "").lower()
    score = 1.0 if name_a[:5] == name_b[:5] else 0.0
    return SimilarityResult(score=score, method="drug_name")

method_registry.register("similarity", "drug_name", drug_name_similarity)

# Now callable via calculate_similarity
from semantica.deduplication import calculate_similarity
result = calculate_similarity(entity_a, entity_b, method="drug_name")

Common Workflows

from semantica.deduplication import DuplicateDetector, EntityMerger

detector   = DuplicateDetector(similarity_threshold=0.8)
candidates = detector.detect_duplicates(entities)

print("Duplicate pairs found:", len(candidates))

merger     = EntityMerger(preserve_provenance=True)
operations = merger.merge_duplicates(entities, strategy="keep_most_complete")

merged_entities = [op.merged_entity for op in operations]

Conflicts

Detect value conflicts between non-duplicate entities.

Knowledge Graph

GraphBuilder uses deduplication during construction.

Normalize

Normalize entity names before deduplication.

Provenance

Track merged entity lineage.