Introduction to the pg_trgm PostgreSQL Extension

The pg_trgm extension solves a fundamental problem: computers don't understand that "similar" doesn't mean "identical."

PostgreSQL's pg_trgm extension enables similarity-based text searching using trigram matching. A trigram is simply a group of three consecutive characters taken from a string. The extension measures how "similar" two strings are by counting their shared trigrams and provides indexed search capabilities that can handle typos, variations, and fuzzy matches without requiring external search engines.

Your users don't type perfectly. They make typos. They use abbreviations. They remember company names slightly wrong. Traditional exact matching fails them every time. Pg_trgm has been part of PostgreSQL since version 9.1. It's battle-tested, production-ready, and probably already available in your database. No external services. No additional infrastructure. No vendor lock-in.

How Trigrams Work

The concept is surprisingly elegant. Take any string and extract all possible three-character sequences: that's your trigrams.

Here's how PostgreSQL does it:

SELECT show_trgm('cat');
-- Result: {"  c"," ca","at ","cat"}

Notice those spaces? PostgreSQL adds two spaces at the beginning and one space at the end. This isn't arbitrary: according to the official PostgreSQL documentation, this padding captures word boundaries and makes the trigram algorithm more effective for real-world text by creating consistent boundary markers for analysis.

For a more complex example:

SELECT show_trgm('PostgreSQL');
-- Result: {"  p"," po","ost","gre","esg","sql","pos","tre","esq","stg","tgr","ql ","pon","ont","ntg"}

Once you have trigrams from two strings, similarity calculation becomes a straightforward set operation. The most common approach is the Jaccard similarity: count the shared trigrams and divide by the total number of unique trigrams in both strings. The result is a score between 0 (completely different) and 1 (identical).

Why three characters? This isn't arbitrary. Academic research consistently shows that trigrams hit the sweet spot:

Bigrams (2 characters) create too many collisions
4-grams suffer from data sparsity: many legitimate sequences never appear
Trigrams balance discrimination power with statistical reliability

According to a 2007 ISCA paper on language identification, trigram models (n=3) generally yield the best performance compared to other n-gram sizes. Kondrak's 2005 research confirms that trigrams achieve the practical sweet spot where sequences are long enough to be distinctive (avoiding bigram collision problems) but short enough to appear frequently (avoiding 4-gram sparsity problems).

Getting Started: Installation and Setup

The beauty of pg_trgm is its simplicity. If you're running PostgreSQL 9.1 or later (and you should be), you already have it:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

That's it. Seriously.

Now you have access to similarity functions and operators. But to make them fast, you'll want indexes.

Core Functions and Helpful Operators

PostgreSQL gives you several ways to work with trigram similarity:

The similarity() function:

SELECT similarity('PostgreSQL', 'Postgres');
-- Result: 0.5454545

The % operator for threshold-based matching:

-- Set your threshold (default is 0.3)
SET pg_trgm.similarity_threshold = 0.4;

SELECT 'PostgreSQL' % 'Postgres';
-- Result: true (because similarity > 0.3, the default threshold)

The distance operator for ranking:

SELECT name, name <-> 'postgres' AS distance
FROM products
WHERE name % 'postgres'
ORDER BY distance
LIMIT 10;

The distance operator (<->) returns 1 minus the similarity score, so lower distances mean higher similarity. This is crucial for ranking search results by relevance.

There's also word_similarity() for substring matching, which is perfect for autocomplete and search-as-you-type applications:

SELECT word_similarity('word', 'two words');
-- Result: 0.8 (matches "word" in "two words")

Performance Optimization: GIN vs GiST

Here's where most developers get it wrong. You have two index types for pg_trgm (GIN and GiST), and conventional wisdom about their performance is often misleading.

Everyone says: "Use GIN indexes, they're faster for searches." However, this conventional wisdom doesn't always hold in practice.

Reality: It depends entirely on your workload, and GiST often performs better in practice. According to real-world testing, GiST indexes can actually outperform GIN indexes significantly: one documented case showed GiST queries completing in approximately 4 seconds versus GIN taking approximately 90 seconds on a 25 million row table.

-- GIN index
CREATE INDEX products_name_gin ON products 
USING GIN (name gin_trgm_ops);

-- GiST index  
CREATE INDEX products_name_gist ON products 
USING GIST (name gist_trgm_ops);

Here are the key differences:

GIN characteristics:

Generally faster for basic WHERE clause filtering
Does NOT support ORDER BY with distance operators
Slower to build and update
Better for relatively static data

GiST characteristics:

Supports ORDER BY with distance operators (essential for ranking)
Faster to build and maintain
Can outperform GIN on large datasets (real testing showed 4 seconds vs 90 seconds on 25 million rows)
Better for frequently updated data

The decision: If you need to rank results by similarity (and you probably do), GiST is required. GIN indexes simply don't support the distance operators needed for ORDER BY ranking.

You can also tune GIST performance with the siglen parameter:

CREATE INDEX products_name_gist ON products 
USING GIST (name gist_trgm_ops(siglen=128));

Higher siglen values improve accuracy but increase index size. According to pganalyze's guide, test values between 64-256 for most workloads, with the default of 12 bytes often being suboptimal.

Real-World Applications

Typo-tolerant search: GitLab implemented pg_trgm across their platform to handle misspellings without external search engines. Users searching for "postgre" find "PostgreSQL" results using the similarity operator (%) and GIN indexes, which measure text similarity based on trigram matching rather than exact substring matching.

Duplicate detection: Find potential duplicates with a simple self-join:

SELECT 
  a.company_name,
  b.company_name,
  similarity(a.company_name, b.company_name) AS score
FROM companies a
JOIN companies b ON a.id < b.id
WHERE similarity(a.company_name, b.company_name) > 0.7
ORDER BY score DESC;

Autocomplete with fuzzy matching: Perfect for search-as-you-type where users make mistakes:

SELECT username
FROM users
WHERE username % 'jhn'  -- Finds "john", "johan", etc.
ORDER BY username <-> 'jhn'
LIMIT 10;

LIKE query optimization: One of the biggest wins is making LIKE '%pattern%' queries fast. Standard B-tree indexes can't help with leading wildcards, but trigram indexes can:

-- This query now uses your trigram index
SELECT * FROM products WHERE name LIKE '%phone%';

Understanding the Limitations

pg_trgm isn't magic. It has clear boundaries you need to understand.

Performance breaks down predictably:

Network latency significantly impacts performance: according to Cybertec's study, 10ms network latency can reduce throughput by approximately 20x
PostgreSQL with pg_trgm works best in single-region deployments where latency stays under 5ms
Connection limits matter: according to production testing, performance degrades significantly beyond 700 concurrent connections, regardless of hardware
Use connection pooling (PgBouncer) if you need higher concurrency

Text length limitations:

Very short strings don't work well: pg_trgm needs meaningful trigram overlap
Strings shorter than 4 characters provide limited discriminative power, as trigrams extract three-character sequences and shorter strings don't generate sufficient overlap for reliable similarity calculations

Semantic understanding:

It's not semantic search: pg_trgm measures character similarity, not meaning
"car" and "automobile" share zero trigrams despite being synonyms
For semantic search, you need full-text search or vector embeddings

When to Choose Alternatives

Consider external search engines like Elasticsearch when you need:

Multi-language search with linguistic analysis
Semantic understanding (synonyms, related concepts)
Sub-100ms response times at massive scale (>1TB)
Multi-region deployment with strict latency requirements
Write throughput exceeding single-node PostgreSQL capacity (typically >100K sustained writes/second)
Horizontal scaling without application-level sharding complexity

Consider distributed PostgreSQL solutions (CockroachDB, YugabyteDB) when you need:

Write throughput exceeding 100K sustained writes/second
Geographic distribution requirements
Horizontal scaling without application-level sharding

But for most web applications: pg_trgm provides 80% of the benefit with 20% of the complexity. Start here.

Getting Started: The Practical Path

The migration path is straightforward and low-risk:

Enable the extension (zero risk): CREATE EXTENSION pg_trgm;
Create indexes concurrently (minimal risk):```sql CREATE INDEX CONCURRENTLY products_name_trgm ON products USING GIST (name gist_trgm_ops);

3. **Test queries in development**:```sql
SET pg_trgm.similarity_threshold = 0.3;
SELECT * FROM products WHERE name % 'search term';

Benchmark your specific workload: Don't trust general advice. Test GIN vs GiST with your data and query patterns.