
The New Rules of Technical Debt: How AI Code Generation Changes Everything About Quality, Testing, and Speed

Leonardo Steffen

The pg_trgm extension solves a fundamental problem: computers don't understand that "similar" doesn't mean "identical."
PostgreSQL's pg_trgm extension enables similarity-based text searching using trigram matching. A trigram is simply a group of three consecutive characters taken from a string. The extension measures how "similar" two strings are by counting their shared trigrams and provides indexed search capabilities that can handle typos, variations, and fuzzy matches without requiring external search engines.
Your users don't type perfectly. They make typos. They use abbreviations. They remember company names slightly wrong. Traditional exact matching fails them every time. Pg_trgm has been part of PostgreSQL since version 9.1. It's battle-tested, production-ready, and probably already available in your database. No external services. No additional infrastructure. No vendor lock-in.
The concept is surprisingly elegant. Take any string and extract all possible three-character sequences: that's your trigrams.
Here's how PostgreSQL does it:
SELECT show_trgm('cat');
-- Result: {" c"," ca","at ","cat"}Notice those spaces? PostgreSQL adds two spaces at the beginning and one space at the end. This isn't arbitrary: according to the official PostgreSQL documentation, this padding captures word boundaries and makes the trigram algorithm more effective for real-world text by creating consistent boundary markers for analysis.
For a more complex example:
SELECT show_trgm('PostgreSQL');
-- Result: {" p"," po","ost","gre","esg","sql","pos","tre","esq","stg","tgr","ql ","pon","ont","ntg"}Once you have trigrams from two strings, similarity calculation becomes a straightforward set operation. The most common approach is the Jaccard similarity: count the shared trigrams and divide by the total number of unique trigrams in both strings. The result is a score between 0 (completely different) and 1 (identical).
Why three characters? This isn't arbitrary. Academic research consistently shows that trigrams hit the sweet spot:
According to a 2007 ISCA paper on language identification, trigram models (n=3) generally yield the best performance compared to other n-gram sizes. Kondrak's 2005 research confirms that trigrams achieve the practical sweet spot where sequences are long enough to be distinctive (avoiding bigram collision problems) but short enough to appear frequently (avoiding 4-gram sparsity problems).
The beauty of pg_trgm is its simplicity. If you're running PostgreSQL 9.1 or later (and you should be), you already have it:
CREATE EXTENSION IF NOT EXISTS pg_trgm;That's it. Seriously.
Now you have access to similarity functions and operators. But to make them fast, you'll want indexes.
PostgreSQL gives you several ways to work with trigram similarity:
The similarity() function:
SELECT similarity('PostgreSQL', 'Postgres');
-- Result: 0.5454545The % operator for threshold-based matching:
-- Set your threshold (default is 0.3)
SET pg_trgm.similarity_threshold = 0.4;
SELECT 'PostgreSQL' % 'Postgres';
-- Result: true (because similarity > 0.3, the default threshold)The distance operator for ranking:
SELECT name, name <-> 'postgres' AS distance
FROM products
WHERE name % 'postgres'
ORDER BY distance
LIMIT 10;The distance operator (<->) returns 1 minus the similarity score, so lower distances mean higher similarity. This is crucial for ranking search results by relevance.
There's also word_similarity() for substring matching, which is perfect for autocomplete and search-as-you-type applications:
SELECT word_similarity('word', 'two words');
-- Result: 0.8 (matches "word" in "two words")Here's where most developers get it wrong. You have two index types for pg_trgm (GIN and GiST), and conventional wisdom about their performance is often misleading.
Everyone says: "Use GIN indexes, they're faster for searches." However, this conventional wisdom doesn't always hold in practice.
Reality: It depends entirely on your workload, and GiST often performs better in practice. According to real-world testing, GiST indexes can actually outperform GIN indexes significantly: one documented case showed GiST queries completing in approximately 4 seconds versus GIN taking approximately 90 seconds on a 25 million row table.
-- GIN index
CREATE INDEX products_name_gin ON products
USING GIN (name gin_trgm_ops);
-- GiST index
CREATE INDEX products_name_gist ON products
USING GIST (name gist_trgm_ops);Here are the key differences:
GIN characteristics:
GiST characteristics:
The decision: If you need to rank results by similarity (and you probably do), GiST is required. GIN indexes simply don't support the distance operators needed for ORDER BY ranking.
You can also tune GIST performance with the siglen parameter:
CREATE INDEX products_name_gist ON products
USING GIST (name gist_trgm_ops(siglen=128));Higher siglen values improve accuracy but increase index size. According to pganalyze's guide, test values between 64-256 for most workloads, with the default of 12 bytes often being suboptimal.
Typo-tolerant search: GitLab implemented pg_trgm across their platform to handle misspellings without external search engines. Users searching for "postgre" find "PostgreSQL" results using the similarity operator (%) and GIN indexes, which measure text similarity based on trigram matching rather than exact substring matching.
Duplicate detection: Find potential duplicates with a simple self-join:
SELECT
a.company_name,
b.company_name,
similarity(a.company_name, b.company_name) AS score
FROM companies a
JOIN companies b ON a.id < b.id
WHERE similarity(a.company_name, b.company_name) > 0.7
ORDER BY score DESC;Autocomplete with fuzzy matching: Perfect for search-as-you-type where users make mistakes:
SELECT username
FROM users
WHERE username % 'jhn' -- Finds "john", "johan", etc.
ORDER BY username <-> 'jhn'
LIMIT 10;LIKE query optimization: One of the biggest wins is making LIKE '%pattern%' queries fast. Standard B-tree indexes can't help with leading wildcards, but trigram indexes can:
-- This query now uses your trigram index
SELECT * FROM products WHERE name LIKE '%phone%';pg_trgm isn't magic. It has clear boundaries you need to understand.
Performance breaks down predictably:
Text length limitations:
Semantic understanding:
Consider external search engines like Elasticsearch when you need:
Consider distributed PostgreSQL solutions (CockroachDB, YugabyteDB) when you need:
But for most web applications: pg_trgm provides 80% of the benefit with 20% of the complexity. Start here.
The migration path is straightforward and low-risk:
CREATE EXTENSION pg_trgm;3. **Test queries in development**:```sql
SET pg_trgm.similarity_threshold = 0.3;
SELECT * FROM products WHERE name % 'search term';

Sergey Kaplich