semantica.embeddings converts text and graph structures into dense vector representations:
  • Provider-agnostic API: FastEmbed (default, ONNX, no GPU), Sentence-Transformers, OpenAI, BGE
  • Powers semantic search, entity resolution, GraphRAG retrieval, and deduplication
  • GraphEmbeddingManager embeds KG nodes and edges for graph database backends
  • Five pooling strategies: Mean (default), Max, CLS, Attention, Hierarchical
  • check_available_providers() shows which backends are installed in your environment

Why Embeddings Matter

Raw text can’t be compared mathematically. Embeddings translate meaning into geometry: two semantically similar sentences produce vectors that are close together in high-dimensional space, even when they share no words. Semantica uses embeddings for:
  • Semantic search: find knowledge graph nodes by meaning, not just keywords
  • Entity resolution: detect that “Apple Inc.” and “Apple Computer” refer to the same entity
  • Deduplication: semantic_v2 strategy measures entity similarity via embedding distance
  • GraphRAG retrieval: hybrid vector + graph traversal for grounded LLM answers
  • Semantic chunking: detect topic shift boundaries in TextSplitter(method="semantic_transformer")

Exported Classes

ClassRole
EmbeddingGeneratorProvider-agnostic entry point: handles batching and provider selection
TextEmbedderText embedding with automatic batch processing; default uses FastEmbed
GraphEmbeddingManagerEmbed KG nodes and edges for GraphRAG and graph databases
VectorEmbeddingManagerPrepare and format embeddings for vector database backends
OpenAIStoreOpenAI text-embedding-3-small / text-embedding-3-large provider
BGEStoreBAAI/bge models via sentence-transformers
FastEmbedStoreONNX-accelerated local embeddings: no CUDA required
LlamaStorePlaceholder store: not production-ready; do not use for embeddings
MeanPoolingDefault pooling strategy: best for retrieval and clustering

What You Get

EmbeddingGenerator

Main entry point: provider-agnostic, handles batching automatically across all backends.

TextEmbedder

Text-specific with automatic batching and progress tracking. Default method is FastEmbed.

GraphEmbeddingManager

Node and edge embeddings for graph databases: Neo4j, NetworkX, FalkorDB.

VectorEmbeddingManager

Prepare, normalize, and format embeddings for FAISS, Weaviate, Qdrant, and Milvus.

Provider Stores

OpenAIStore, BGEStore, FastEmbedStore, and ProviderStoreFactory.

Pooling Strategies

Mean, Max, CLS, Attention, and Hierarchical: control token-to-vector aggregation.

Provider Setup

ONNX-accelerated local embeddings. No GPU required, no API key. Best starting point.
pip install "semantica[fastembed]"
from semantica.embeddings import EmbeddingGenerator

# FastEmbed is the default: no config needed
generator = EmbeddingGenerator()
embedding = generator.generate_embeddings("Text about AI")
Default model is BAAI/bge-small-en-v1.5. Zero cost, zero GPU, works on any machine.
FastEmbed ignores the device parameter. FastEmbed uses ONNX Runtime and manages its own execution providers: passing device="cuda" has no effect. Switch to method="sentence_transformers" if you need explicit GPU control.
Check which providers are installed in your environment:
from semantica.embeddings import check_available_providers

providers = check_available_providers()
# → {"sentence_transformers": True, "fastembed": True, "openai": False}

Getting Started

EmbeddingGenerator is the fastest path to embeddings: the default method is FastEmbed (ONNX, no GPU needed):
from semantica.embeddings import EmbeddingGenerator

# Default: FastEmbed with BAAI/bge-small-en-v1.5
generator = EmbeddingGenerator()

# Embed a single text
embedding = generator.generate_embeddings("Text about AI")

# Embed a batch
embeddings = generator.generate_embeddings(["Text about AI", "Machine learning concepts"])

# Compare two embeddings (cosine similarity: 0.0 to 1.0)
score = generator.compare_embeddings(embeddings[0], embeddings[1], method="cosine")
print(f"Similarity: {score:.3f}")
Always use the same model for indexing and querying. Vectors from different models are not comparable: they live in different vector spaces. Switching models requires re-embedding your entire corpus.
To switch provider after construction:
# Switch to a sentence-transformers model
generator.set_text_model("sentence_transformers", "all-MiniLM-L6-v2")

# Switch to BGE large
generator.set_text_model("sentence_transformers", "BAAI/bge-large-en-v1.5")

Quick Start

1

Install and initialize a provider

from semantica.embeddings import EmbeddingGenerator

# Default: FastEmbed, free, runs locally with no GPU
generator = EmbeddingGenerator()

# Use sentence-transformers instead
generator = EmbeddingGenerator(config={"text": {"method": "sentence_transformers", "model_name": "all-MiniLM-L6-v2"}})
2

Generate embeddings

# Single text → 1D array
embedding = generator.generate_embeddings("Text about AI")

# Batch → 2D array (n_texts, dim)
embeddings = generator.generate_embeddings(["Text about AI", "Machine learning concepts"])
3

Compute similarity

# Cosine similarity: 0.0 (unrelated) to 1.0 (identical meaning)
score = generator.compare_embeddings(embeddings[0], embeddings[1], method="cosine")
print(f"Similarity: {score:.3f}")
4

Prepare for a vector database

from semantica.embeddings import VectorEmbeddingManager
import numpy as np

manager = VectorEmbeddingManager()

embeddings = np.array([...], dtype=np.float32)
metadata   = [{"text": "doc 1"}, {"text": "doc 2"}]

result = manager.prepare_for_vector_db(embeddings, metadata=metadata, backend="faiss")
# result["vectors"]  → normalized float32 array
# result["ids"]      → ["vec_0", "vec_1", ...]
# result["metadata"] → formatted metadata list

Supported Models

ProviderModelDimensionSpeedBest For
fastembedBAAI/bge-small-en-v1.5384Very fastDefault: CPU-optimised, no GPU required
sentence_transformersall-MiniLM-L6-v2384FastGood balance of speed and quality
sentence_transformersall-mpnet-base-v2768MediumHigher retrieval quality
sentence_transformersBAAI/bge-large-en-v1.51024MediumState-of-the-art retrieval accuracy
openaitext-embedding-3-small1536APICost-effective OpenAI embedding
openaitext-embedding-3-large3072APIHighest quality via OpenAI API

EmbeddingGenerator

from semantica.embeddings import EmbeddingGenerator

# Default: FastEmbed with BAAI/bge-small-en-v1.5
generator = EmbeddingGenerator()
embeddings = generator.generate_embeddings(texts)
similarity = generator.compare_embeddings(embeddings[0], embeddings[1])
Best for: CPU-only production, lowest latency without GPU. Default: works out of the box.

Constructor Parameters

ParameterTypeDefaultDescription
configdictNoneConfig dict; config["text"] is passed to TextEmbedder
**kwargsAdditional key/value config merged into config
Use generator.set_text_model(method, model_name) to switch the embedding model after construction.

TextEmbedder

Direct text embedding with batch processing:
from semantica.embeddings import TextEmbedder

# Default: FastEmbed with BAAI/bge-small-en-v1.5
embedder = TextEmbedder()

# Single text → 1D array
embedding = embedder.embed_text("A knowledge graph connects entities with typed relationships.")

# Batch → 2D array (n_texts, dim)
embeddings = embedder.embed_batch(["First text", "Second text", "Third text"])

# Per-sentence embeddings
sentence_embeddings = embedder.embed_sentences("First sentence. Second sentence.")

# Get embedding dimension
dim = embedder.get_embedding_dimension()

TextEmbedder Constructor Parameters

ParameterTypeDefaultDescription
model_namestr"BAAI/bge-small-en-v1.5"Model name to load
methodstr"fastembed"Embedding method: "fastembed" or "sentence_transformers"
devicestr"cpu"Device for sentence-transformers: "cpu", "cuda", "mps". Ignored for FastEmbed.
normalizeboolTrueL2-normalize output vectors
Key behaviours:
  • If FastEmbed or sentence-transformers is unavailable, falls back to a 128-dimensional hash-based embedding. Hash embeddings are deterministic but not semantic: do not use in production.
  • Large batches are chunked internally by the underlying library to avoid OOM.
Dimension mismatch. The dimension you pass to your vector store must exactly match your embedding model’s output. BAAI/bge-small-en-v1.5 → 384, all-MiniLM-L6-v2 → 384, all-mpnet-base-v2 → 768, BAAI/bge-large-en-v1.5 → 1024. Check with embedder.get_embedding_dimension() before creating the store.
Fallback embeddings are not semantic. If neither FastEmbed nor sentence-transformers loads successfully, TextEmbedder silently falls back to 128-dimensional SHA-256 hash embeddings. These are deterministic but carry no semantic meaning. Check embedder.get_method(): if it returns "fallback", install your intended provider.

Provider Stores

Use provider stores directly when you need fine-grained control over a single backend:
from semantica.embeddings import (
    OpenAIStore, BGEStore, FastEmbedStore,
    ProviderStoreFactory,
)
import os

# OpenAI
store     = OpenAIStore(api_key=os.getenv("OPENAI_API_KEY"), model="text-embedding-3-small")
embedding = store.embed("Hello world")

# BGE (Sentence-Transformers wrapper): pass model_name= not model=
store     = BGEStore(model_name="BAAI/bge-large-en-v1.5")
embedding = store.embed("Hello world")

# FastEmbed: ONNX runtime, no CUDA required
store     = FastEmbedStore(model_name="BAAI/bge-small-en-v1.5")
embedding = store.embed("Hello world")
# FastEmbedStore also has an efficient batch method
embeddings = store.embed_batch(["text1", "text2", "text3"])

# Auto-select from a name string: useful in config-driven pipelines
# Supported providers: "openai", "bge", "fastembed"
store = ProviderStoreFactory.create(provider="bge", model_name="BAAI/bge-large-en-v1.5")
LlamaStore exists in the module but is a placeholder: it does not connect to Ollama and always raises ProcessingError at embed time. Do not use it in production.
LlamaStore is not functional. LlamaStore exists in the module but does not connect to Ollama. It always raises ProcessingError at embed time. Use FastEmbedStore for local ONNX-based embeddings or BGEStore for sentence-transformers-based local embeddings instead.

Pooling Strategies

Pooling aggregates a set of embeddings into a single vector: useful when you have multiple chunk embeddings to combine:
from semantica.embeddings import MeanPooling

pooler = MeanPooling()
pooled = pooler.pool(token_embeddings)   # shape: (hidden_dim,)
Best for: retrieval, semantic search, and clustering: averages all contributions.

GraphEmbeddingManager

Embed graph nodes and edges for storage in graph databases:
from semantica.embeddings import GraphEmbeddingManager

manager = GraphEmbeddingManager()

entities = [
    {"id": "e1", "text": "Apple Inc.", "type": "Organization"},
    {"id": "e2", "text": "Tim Cook",   "type": "Person"},
]
relationships = [
    {"source": "e2", "target": "e1", "type": "CEO_OF"}
]

# Embed entities → dict of {id: np.ndarray}
node_embeddings = manager.embed_entities(entities)

# Embed relationships → dict of {id: np.ndarray}
edge_embeddings = manager.embed_relationships(relationships)

# Or prepare everything at once for a graph DB backend
result = manager.prepare_for_graph_db(entities, relationships, backend="neo4j")
# result["node_embeddings"] → {id: np.ndarray}
# result["edge_embeddings"] → {id: np.ndarray}
# result["nodes"]           → entities with "embedding" field added
# result["edges"]           → relationships with "embedding" field added
Supported backends: "neo4j", "networkx", "falkordb"

VectorEmbeddingManager

Prepare and validate embeddings for vector database storage:
from semantica.embeddings import VectorEmbeddingManager
import numpy as np

manager    = VectorEmbeddingManager()
embeddings = np.random.rand(5, 384).astype(np.float32)
metadata   = [{"text": f"doc_{i}", "category": "science"} for i in range(5)]

# Prepare for FAISS
result = manager.prepare_for_vector_db(embeddings, metadata=metadata, backend="faiss")
# result["vectors"]  → L2-normalized float32 array
# result["ids"]      → ["vec_0", "vec_1", ...]
# result["metadata"] → formatted metadata list

# Validate dimensions before insertion
is_valid = manager.validate_dimensions(embeddings, backend="milvus")

# Prepare multiple batches at once
combined = manager.batch_prepare([embeddings_a, embeddings_b], backend="qdrant")
Supported backends: "faiss", "weaviate", "qdrant", "milvus"

Common Workflows

from semantica.embeddings import TextEmbedder

embedder = TextEmbedder()   # default: FastEmbed

texts = [
    "Apple Inc. was founded by Steve Jobs.",
    "Microsoft was co-founded by Bill Gates.",
    "Amazon was started by Jeff Bezos.",
]

# All at once: more efficient than calling embed_text() per item
embeddings = embedder.embed_batch(texts)
print(f"Shape: {embeddings.shape}")   # (3, 384)

Similarity Computation

from semantica.embeddings import calculate_similarity

# Cosine similarity: direction only, not magnitude; most common for text
score = calculate_similarity(embedding_a, embedding_b, method="cosine")
# → 0.0 (orthogonal / unrelated) to 1.0 (identical direction)

# Euclidean distance converted to similarity
score = calculate_similarity(embedding_a, embedding_b, method="euclidean")

Convenience Functions

from semantica.embeddings import (
    embed_text, generate_embeddings, calculate_similarity,
    pool_embeddings, check_available_providers,
)

# Single text: fastest path
emb = embed_text("Hello world", method="sentence_transformers")

# Batch
embs = generate_embeddings(["text1", "text2"], method="default")

# Pool multiple embeddings into one
pooled = pool_embeddings(embs, method="mean")

# Check which providers are installed
providers = check_available_providers()
# → {"sentence_transformers": True, "fastembed": True, "openai": False}

Vector Store

Store and search the generated embeddings.

Split

Chunk text before embedding for better retrieval quality.

KG Module

Distance Intelligence uses graph embeddings for semantic neighbourhoods.

Deduplication

Semantic deduplication uses embedding distance for entity resolution.