PgVectorStore adds PostgreSQL-native vector storage and similarity search to Semantica: no dedicated vector database required.

Overview

PgVectorStore provides native PostgreSQL vector storage using the pgvector extension. It supports multiple distance metrics (cosine similarity, L2/Euclidean, inner product), index types (IVFFlat, HNSW), and JSONB metadata storage with filtering.

Features

Distance metrics: cosine, L2 (Euclidean), inner product
Index types: IVFFlat and HNSW for approximate nearest-neighbor search
JSONB metadata storage with filtering support
Connection pooling via psycopg3/psycopg2
Batch insert, update, and delete
Idempotent index creation: safe to call multiple times

Setup

Prerequisites

  1. PostgreSQL 13+ with pgvector extension installed
  2. Python dependencies: psycopg3 (preferred) or psycopg2-binary, pgvector

Installing Dependencies

# Install with pgvector support
pip install semantica[vectorstore-pgvector]

# Or install manually
pip install psycopg[binary] pgvector
# Or for psycopg2
pip install psycopg2-binary pgvector

PostgreSQL Setup

1

Install the pgvector extension

-- Debian/Ubuntu
sudo apt-get install postgresql-16-pgvector

-- macOS (Homebrew)
brew install pgvector

-- Or build from source: https://github.com/pgvector/pgvector
2

Create the extension in your database

CREATE EXTENSION vector;
3

Verify installation

SELECT * FROM pg_extension WHERE extname = 'vector';

Docker Quickstart

docker run -d \
    --name pgvector \
    -e POSTGRES_PASSWORD=postgres \
    -p 5432:5432 \
    ankane/pgvector:latest

Connection String Format

Standard PostgreSQL connection string:
postgresql://user:password@host:port/database
Examples:
# Local development
"postgresql://postgres:postgres@localhost:5432/semantica"

# With SSL
"postgresql://user:pass@host/db?sslmode=require"

# Connection parameters
"postgresql://user:pass@localhost/db?connect_timeout=10&application_name=semantica"

Usage

Basic Usage

from semantica.vector_store import PgVectorStore
import numpy as np

# Initialize store
store = PgVectorStore(
    connection_string="postgresql://postgres:postgres@localhost:5432/semantica",
    table_name="document_vectors",
    dimension=768,
    distance_metric="cosine",
    pool_size=10
)

# Add vectors
vectors = [np.random.rand(768).astype(np.float32) for _ in range(100)]
metadata = [{"doc_id": i, "category": "article"} for i in range(100)]
ids = store.add(vectors, metadata)

# Search
query = np.random.rand(768).astype(np.float32)
results = store.search(query, top_k=10)

# Results format: [{"id": "...", "score": 0.95, "metadata": {...}}, ...]
for result in results:
    print(f"ID: {result['id']}, Score: {result['score']:.4f}")

# Close store
store.close()

Context Manager

with PgVectorStore(
    connection_string="postgresql://...",
    table_name="vectors",
    dimension=768,
    distance_metric="cosine"
) as store:
    vectors = [np.random.rand(768).astype(np.float32)]
    ids = store.add(vectors, [{"source": "test"}])
    # Store automatically closed on exit

Metadata Filtering

# Search with metadata filter
results = store.search(
    query_vector,
    top_k=10,
    filter={"category": "science", "published": True}
)

Update and Delete

# Update vectors and metadata
new_vectors = [np.random.rand(768).astype(np.float32)]
new_metadata = [{"updated": True}]
store.update(["vec_0"], new_vectors, new_metadata)

# Update metadata only
store.update(["vec_0"], metadata=[{"tag": "updated"}])

# Delete vectors
store.delete(["vec_0", "vec_1"])

Retrieve by ID

results = store.get(["vec_0", "vec_1"])
# Returns: [{"id": "vec_0", "vector": np.array(...), "metadata": {...}}, ...]

Index Creation

# Create HNSW index for approximate nearest neighbor search
store.create_index(
    index_type="hnsw",
    params={"m": 16, "ef_construction": 64}
)

# Create IVFFlat index
store.create_index(
    index_type="ivfflat",
    params={"lists": 100}
)
Index creation is idempotent: calling multiple times is safe.

Statistics

stats = store.get_stats()
# Returns: {
#     "table_name": "document_vectors",
#     "dimension": 768,
#     "distance_metric": "cosine",
#     "vector_count": 1000,
#     "indexes": [...],
#     "psycopg_version": "3"
# }

Distance Metrics

MetricOperatorDescriptionUse Case
cosine<=>Cosine distance (1 - cosine similarity)Semantic similarity, text embeddings
l2<->Euclidean distanceGeometric distance, clustering
inner_product<#>Negative inner productMaximum inner product search
Note: Scores returned by search() are normalized to similarity (higher = better) regardless of metric.

Index Types

Hierarchical Navigable Small World: best for high-dimensional vectors with high recall requirements.
ProsFast search, good recall, incremental build
ConsHigher memory usage, slower build time
store.create_index(index_type="hnsw", params={
    "m": 16,               # connections per layer (default: 16)
    "ef_construction": 64  # build-time accuracy/speed tradeoff (default: 64)
})

Schema

The vector table schema:
CREATE TABLE IF NOT EXISTS {table_name} (
    id TEXT PRIMARY KEY,
    vector VECTOR({dimension}),
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT NOW()
);

Migration Notes

From Other Vector Stores

# Export from existing store
from semantica.vector_store import FAISSStore

faiss_store = FAISSStore(dimension=768)
# ... load existing index

# Migrate to PgVectorStore
pg_store = PgVectorStore(
    connection_string="postgresql://...",
    table_name="migrated_vectors",
    dimension=768,
    distance_metric="cosine"
)

# Get all vectors from source
all_ids = list(faiss_store.index.vector_ids)
all_vectors = [...]  # Get vectors from source
all_metadata = [faiss_store.index.metadata.get(id, {}) for id in all_ids]

# Batch insert
pg_store.add(all_vectors, all_metadata, all_ids)

Backup and Restore

Use PostgreSQL native backup tools:
# Backup
pg_dump -h localhost -U postgres -d semantica -t document_vectors > vectors_backup.sql

# Restore
psql -h localhost -U postgres -d semantica < vectors_backup.sql

Configuration

Connection Pool Settings

store = PgVectorStore(
    connection_string="postgresql://...",
    table_name="vectors",
    dimension=768,
    distance_metric="cosine",
    pool_size=10,        # Max connections in pool
    max_overflow=10      # Extra connections beyond pool_size
)

Environment Variables

# Connection string via environment
export SEMANTICA_PGVECTOR_URL="postgresql://user:pass@host/db"
import os

store = PgVectorStore(
    connection_string=os.getenv("SEMANTICA_PGVECTOR_URL"),
    table_name="vectors",
    dimension=768,
    distance_metric="cosine"
)

Error Handling

Common errors and solutions:
ErrorCauseSolution
ProcessingError: pgvector extension is not installedpgvector not in PostgreSQLRun CREATE EXTENSION vector;
ValidationError: Unsupported distance metricInvalid metricUse: cosine, l2, inner_product
ValidationError: dimension mismatchVector dim != store dimEnsure consistent dimensions
ProcessingError: Failed to initialize connection poolConnection issueCheck connection string, network

Performance Tuning

  1. Use indexes for large datasets (>10k vectors)
  2. Tune HNSW parameters: Higher m and ef_construction = better recall, slower build
  3. Connection pool size: Set based on concurrent workload
  4. Batch operations: Use add() with lists instead of individual inserts

Testing

Tests require a running PostgreSQL with pgvector:
# Start PostgreSQL with Docker
docker run -d \
    --name pgvector-test \
    -e POSTGRES_PASSWORD=postgres \
    -p 5432:5432 \
    ankane/pgvector:latest

# Run tests
pytest tests/vector_store/test_pgvector_store.py -v

# Or with specific connection string
TEST_PGVECTOR_URL="postgresql://postgres:postgres@localhost:5432/test" \
    pytest tests/vector_store/test_pgvector_store.py -v

See Also