Turbopuffer: Serverless Vector Database

Modern applications need to find similar content: whether that's matching user queries to relevant documents, recommending products, or powering AI features like semantic search. The technology behind this (called vector databases) forces developers into an uncomfortable choice: pay premium prices for managed solutions, or manage complex self-hosted systems. A serverless vector database that eliminates infrastructure overhead while delivering competitive performance at a fraction of the cost changes this equation entirely.

What is Turbopuffer?

Why This Matters

Before diving into Turbopuffer's specific approach, it's important to understand why developers should care about vector databases. When you need to find "similar" content, like matching a user's question to relevant documentation, powering recommendation engines, or enabling AI chatbots to reference relevant context, traditional keyword search falls short. Vector databases convert content into mathematical representations (vectors) that capture semantic meaning, enabling applications to find truly relevant results rather than just keyword matches. This capability has become essential for everything from AI-powered search to recommendation systems to Retrieval-Augmented Generation (RAG) applications.

What Problem Does This Solve for Developers?

Traditional vector databases force you to choose between convenience and cost. Managed services offer simplicity but at premium pricing, while self-hosted solutions require significant operational overhead. You can use Turbopuffer to orchestrate vector search operations that deliver 10x cost reductions compared to alternatives.

Turbopuffer is a serverless vector database that combines the operational simplicity of managed services with the cost efficiency of self-hosted solutions. The system splits data storage from processing power, letting each scale independently: a hybrid architecture that separates storage from compute, using object storage for data persistence and NVMe SSD caching for query performance.

Market Positioning

In the vector database landscape, Turbopuffer occupies a unique position as a serverless-first solution. Unlike traditional SaaS models or self-hosted approaches, Turbopuffer's architecture enables automatic scaling without capacity planning while maintaining cost predictability through usage-based billing.

Why It Exists & How It Works

Turbopuffer emerged from the recognition that existing vector database architectures weren't optimized for the elastic, cost-sensitive nature of modern AI applications. Traditional approaches tie storage and compute together, forcing you to pay for idle capacity or risk performance degradation during traffic spikes.

High-Level Architecture

The system employs a hybrid serverless design that fundamentally separates state management from compute operations:

Object Storage Layer: Serves as the authoritative data source with write-ahead logging to ensure data isn't lost during writes (WAL for consistency)
Compute Layer: Rust-based query processing that accesses data directly from storage
Caching Layer: NVMe SSD and memory caching for performance optimization

This separation enables the system to scale compute resources independently while maintaining data durability through persistent object storage. The result is a database that can handle massive scale without requiring infrastructure management.

Key Features & Capabilities

Fast Search with Built-in Filtering (SPFresh Vector Indexing)

You can build search systems that are both fast and precise while filtering results by metadata, something many vector databases struggle with. This works through their SPFresh algorithm, a specialized indexing system designed specifically to support efficient native filtering. Unlike many vector databases that treat filtering as an afterthought, Turbopuffer's approach combines attribute and vector indexes seamlessly.

To find similar content quickly among millions of documents, you can deploy centroid-based approximate nearest neighbor indexing with hybrid index architecture that combines attribute and vector indexes for complex queries without performance degradation.

Implementation Details:

Accuracy: Delivers 90-100% recall rates for vector similarity searches
Filtering Performance: Up to 4x faster filtering compared to alternatives in specific use cases
Hybrid Queries: Combines vector search with full-text search (BM25) in single operations

Use Case Example:

# Find similar documents with specific metadata constraints
results = ns.query(
    vector=[0.1, 0.2, 0.3],
    filters={'category': 'technical', 'date': {'$gte': '2024-01-01'}},
    top_k=10,
    include_attributes=True
)

REST API with Comprehensive Operations

You can orchestrate vector operations through a REST-based interface with JSON encoding, supporting both basic vector operations and advanced query patterns for TypeScript AI agent frameworks.

Core Operations:

Vector Operations: Upsert, patch, delete, conditional writes
Query Types: Vector search, full-text search, multi-queries, aggregations
Data Management: Delete by filter, copy between namespaces, bulk operations

Advanced Query Capabilities:

# Multi-query with different search types
results = ns.query([
    {'vector': [0.1, 0.2], 'top_k': 5},
    {'query': 'machine learning', 'top_k': 10},
    {'filters': {'type': 'research'}, 'top_k': 3}
])

Official SDK Support Across Six Languages

You can build AI agents with Turbopuffer using official SDKs for Python 3.8+, TypeScript/JavaScript, Go, Java, Ruby, and community-maintained Rust, with consistent APIs and strong typing support perfect for TypeScript AI agent frameworks.

Python SDK Features:

Performance: Optional C binaries with pip install turbopuffer[fast]
Async Support: Full async/await compatibility with AsyncTurbopuffer
Type Safety: TypedDicts for requests, Pydantic models for responses

import turbopuffer as tpuf

# Initialize with automatic API key detection
client = tpuf.Turbopuffer()  # Uses TURBOPUFFER_API_KEY
ns = client.namespace("documents")

# Batch upsert with attributes
ns.upsert(
    ids=[1, 2, 3],
    vectors=[[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]],
    attributes={
        "title": ["Doc 1", "Doc 2", "Doc 3"],
        "category": ["tech", "business", "tech"]
    }
)

Serverless Architecture with Intelligent Caching

You can deploy vector search systems that scale automatically thanks to the platform's caching system that dramatically improves performance through intelligent NVMe SSD and memory caching, with performance characteristics showing cold queries at p90=444ms versus warm queries at p50=8ms for 1M vectors.

Cache Management:

Automatic Warming: Frequently accessed data stays cached
Memory Efficiency: 98.5% disk budget allocation by default
Performance Monitoring: Built-in metrics for cache hit rates

Getting Started

You can deploy Turbopuffer with minimal configuration thanks to its serverless design and official SDK support across multiple programming languages, making it perfect for integrating with TypeScript AI agent frameworks.

Quick Setup

1. Install the SDK

# Python with performance optimizations
pip install turbopuffer[fast]

# TypeScript/JavaScript
npm install @turbopuffer/turbopuffer

# Go
go get github.com/turbopuffer/turbopuffer-go

2. Authentication

export TURBOPUFFER_API_KEY="your_api_key_here"

3. Basic Implementation

import turbopuffer as tpuf

# Initialize client (auto-detects API key)
client = tpuf.Turbopuffer()
ns = client.namespace("quickstart")

# Add documents with vectors and metadata
ns.upsert(
    vectors=[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
    attributes={
        "text": ["First document", "Second document"],
        "category": ["demo", "example"]
    }
)

# Search for similar content
results = ns.query(
    vector=[0.1, 0.2, 0.3],
    top_k=5,
    include_attributes=True
)

for result in results:
    print(f"Score: {result.similarity}, Text: {result.attributes['text']}")

Development Features

You can leverage several developer-friendly features that streamline the development process:

Strong Typing: Full IDE autocomplete and type checking
Auto-pagination: Built-in iterators for large result sets
Error Handling: Comprehensive error responses with JSON formatting
HTTP Optimization: Automatic gzip compression and retries

Advanced Usage & Best Practices

Production deployments require careful attention to performance tuning parameters and architectural decisions that can significantly impact both performance and costs when building agentic workflows.

Performance Optimization

Query Concurrency Tuning:

# Configure for high-throughput scenarios
client = tpuf.Turbopuffer(
    max_concurrent_queries=16,  # Default per namespace
    max_wait_time_ms=800        # Maximum wait for query slot
)

You can optimize cache management and indexing for better agent orchestration performance:

Default Configuration: 98.5% of NVMe SSD capacity allocated to cache
Indexing Thresholds: Minimum 5,000 unindexed documents trigger reindexing
Batch Optimization: Use 256 MB batches for up to 50% cost discounts

These optimizations enable sophisticated memory retrieval patterns for AI agents that need to access contextual information efficiently.

Production Architecture Patterns

Namespace Organization Strategy: The most critical architectural decision involves namespace design when building TypeScript AI agent frameworks. Turbopuffer's per-namespace limits require careful planning:

# Recommended: Multiple smaller namespaces
user_docs = client.namespace(f"user_{user_id}_docs")
product_search = client.namespace("product_search_main")

# Avoid: Single large namespace approaching limits
# global_docs = client.namespace("all_documents")  # Will hit 250M doc limit

Efficient Filtering Patterns:

# Optimized: Specific filter patterns
results = ns.query(
    vector=query_vector,
    filters={'status': 'published', 'category': {'$in': ['tech', 'science']}},
    top_k=20
)

# Avoid: Expensive glob patterns in middle
# filters={'title': {'$glob': '*report*'}}  # Triggers full scan

Cost Optimization Strategies

Batching for Discounts:

# Efficient: Batch up to 256 MB for cost savings
batch_vectors = []
batch_attributes = {}

for document in large_dataset:
    batch_vectors.append(document.vector)
    # ... collect batch data

    if len(batch_vectors) >= 1000:  # Or size-based threshold
        ns.upsert(vectors=batch_vectors, attributes=batch_attributes)
        batch_vectors.clear()

ID Type Optimization: Choose efficient ID formats to minimize storage overhead:

Best: Native integers (user_id: int)
Good: 16-byte UUIDs in binary format
Avoid: 36-byte UUID strings (significant overhead)

Real-World Usage

You can see how Turbopuffer powers vector search for several high-profile applications, with the most documented success story being Cursor's implementation for AI-powered code assistance.

Cursor: 10x Cost Reduction with Billions of Vectors

Cursor, the AI-powered code editor, migrated to Turbopuffer to manage billions of vectors across millions of codebases. The migration delivered:

10x cost reduction compared to their previous solution
Significant latency improvements for both cold and warm queries
Simplified architecture eliminating manual capacity planning

Technical Implementation:

# Cursor's approach to namespace sharding for scale
def get_codebase_namespace(repo_id):
    shard = hash(repo_id) % 10  # Distribute across 10 namespaces
    return client.namespace(f"codebase_shard_{shard}")

# Index code embeddings with metadata
ns = get_codebase_namespace(repo_id)
ns.upsert(
    vectors=code_embeddings,
    attributes={
        'file_path': file_paths,
        'function_name': function_names,
        'language': languages
    }
)

Other Production Deployments

Readwise: Powers AI search features in their knowledge management platform, enabling semantic search across user-saved articles and highlights.

Enterprise Scale: The platform currently serves over 1 trillion documents across all deployments, processing 10 million+ writes per second and serving 10,000+ queries per second.

Limitations & Considerations

Understanding Turbopuffer's constraints is crucial for architectural planning when building AI agents, as several fundamental limits can impact scalability.

Critical Scaling Constraints

Per-Namespace Document Limits: The most significant constraint is per-namespace rather than global limits:

Current: ~250 million documents (512 GB) per namespace
Expanding to: 1 billion documents (1 TB) per namespace
Write Throughput: 10,000 writes/second per namespace (32 MB/s)
Query Throughput: ~1,000 QPS per namespace, expanding to 10,000

Document and Attribute Constraints:

# Maximum limits to consider in design
document = {
    'id': 'max_128_bytes',           # ID length limit
    'vector': [0.1] * 10752,         # Max dimensions (affects cost/latency)
    'attributes': {
        'large_text': 'max_8_MiB',   # Per-attribute value limit
        'filterable': 'max_4_KiB'    # Filterable values much smaller
    }
}

# Maximum 256 attributes per namespace total

Performance Gotchas

Cache-Dependent Performance: Cold queries can be 50x slower than warm queries (444ms vs 8ms p50), making cache warming strategies critical for consistent performance in agentic workflows.

Expensive Filter Patterns:

# Avoid: Glob patterns with wildcards in middle
{'title': {'$glob': '*report*'}}  # Triggers full namespace scan

# Prefer: Prefix or suffix patterns
{'title': {'$glob': 'report*'}}   # Uses index efficiently

Namespace Sharding Complexity: Large applications require manual sharding across namespaces, adding operational complexity:

# Required pattern for scale beyond single namespace limits
def get_document_namespace(doc_id):
    shard = hash(doc_id) % 10
    return client.namespace(f"docs_shard_{shard}")

# Must implement cross-shard querying manually
results = []
for shard in range(10):
    ns = client.namespace(f"docs_shard_{shard}")
    shard_results = ns.query(vector=query_vector, top_k=20)
    results.extend(shard_results)

Cost and Operational Considerations

Minimum Commitments: Monthly minimums don't roll over, requiring careful capacity planning despite the serverless architecture.

Filterable Attribute Overhead: Making attributes filterable costs 50% more for indexing, requiring strategic decisions about which metadata needs filtering capability.

Deployment Options

You can deploy Turbopuffer using two deployment models to address different organizational requirements and compliance needs.

Cloud-Hosted (Standard)

The default deployment requires minimal setup and provides immediate access to the full platform capabilities:

# Simple cloud deployment setup
export TURBOPUFFER_API_KEY="your_api_key"
pip install turbopuffer[fast]

BYOC (Bring Your Own Cloud)

For organizations requiring data sovereignty or custom security controls, BYOC deployment supports Kubernetes environments across AWS, GCP, and Azure.

BYOC Features:

Multi-AZ Support: Deployment across multiple availability zones
Custom Authentication: Self-managed API key systems
Data Partitioning: Control over data distribution and isolation
Compliance: Meets SOC 2 Type 2, HIPAA requirements

Kubernetes Configuration:

# Example Helm values for production BYOC
turbopuffer:
  cache:
    diskBudget: "95%"
  performance:
    maxConcurrentQueries: 16
    maxWaitTimeMs: 800
  indexing:
    minUnindexed: 5000
    maxUnindexed: 50000

FAQ

How does Turbopuffer's serverless architecture affect performance predictability?

Turbopuffer's caching system provides predictable performance for frequently accessed data (p50=8ms warm queries), but cold queries can experience higher latency (p90=444ms). The system automatically warms caches based on access patterns, and production deployments should implement cache warming strategies for critical data paths in AI agent workflows.

What happens when I hit the 250 million document limit per namespace?

You'll need to implement namespace sharding before reaching the limit. The recommended approach is horizontal partitioning based on logical boundaries (user IDs, content categories, time ranges). Turbopuffer is expanding limits to 1 billion documents per namespace, but architectural planning for sharding is still recommended for applications expecting massive scale.

How does Turbopuffer's cost structure compare to self-hosted solutions?

While self-hosted solutions have lower direct costs, Turbopuffer eliminates infrastructure management, monitoring, backup, and scaling responsibilities. For variable workloads, Turbopuffer's usage-based pricing often results in lower total cost of ownership. Fixed, high-volume workloads may find self-hosted solutions more cost-effective if you have the operational expertise.

Can I migrate existing vector data from other databases to Turbopuffer?

Yes, you can orchestrate data migrations using Turbopuffer's REST API and SDKs with bulk import operations. For large migrations, implement batching strategies up to 256 MB per batch to optimize costs. The migration process typically involves extracting vectors and metadata from your existing system and using Turbopuffer's upsert operations with appropriate namespace organization.

What's the recommended approach for hybrid search combining vector and full-text search?

You can build hybrid search systems with Turbopuffer that natively support queries combining vector similarity with BM25 full-text search. Store your text content in attributes and use multi-query operations to combine vector search results with full-text search results. The system handles scoring and ranking across both search types automatically.

How do I optimize for the lowest possible query latency?

Focus on cache warming strategies, efficient namespace organization (10 namespaces of 1M documents vs 1 namespace of 10M), and avoiding expensive filter patterns. Use shorter ID formats (integers vs UUID strings), batch writes to maintain good indexing thresholds, and consider geographic deployment closer to your users.

What monitoring and observability features are available for production deployments?

Turbopuffer provides built-in metrics for cache hit rates, query performance, and index statistics. BYOC deployments integrate with standard Kubernetes observability platforms. Monitor cache hit rates, query concurrency levels, and indexing lag as key performance indicators for your AI agent workflows.

How does namespace design affect both performance and costs?

Namespace design is critical for both performance and cost optimization when building TypeScript AI agent frameworks. Multiple smaller namespaces (10 namespaces of 1M documents) generally perform better than single large namespaces due to indexing efficiency and cache locality. However, cross-namespace queries require application-level coordination and may increase complexity in agentic workflows.

The Serverless Vector Database Advantage

Turbopuffer represents a fundamental shift in vector database architecture, proving that you don't have to choose between operational simplicity and cost efficiency. By separating storage from compute and implementing intelligent caching, it delivers the ease of managed services with economics that scale naturally with usage.

For teams building TypeScript AI agent frameworks, you can use Turbopuffer to orchestrate vector search with enterprise-grade performance and reliability without the enterprise-grade price tag or operational overhead. While the ecosystem remains smaller than established alternatives, the technical advantages and proven production success at companies like Cursor make it a serious contender for cost-conscious applications requiring reliable vector search capabilities in agentic workflows.

The serverless approach isn't just about reducing costs: it's about removing infrastructure decisions from the critical path of AI application development, letting teams focus on building great user experiences rather than managing database clusters.