Aapeli Vuorinen

Postgres Text Search: Full Text vs Trigram Search

Postgres (or PostgreSQL, if you like) is an amazingly powerful database engine. I’ve recently been working hard on Couchers.org, and as part of that, I’ve come to really dig into Postgres and understand the incredible value of it. For our use case in an open-source project built by a handful of passionate volunteers, it’s very useful when we don’t need to pull in additional dependencies and complicate the architecture. To this end, we’ve found Postgres to be a really great fit: it just has so much functionality packed right into a stable and well supported package.

As part of implementing text search for Couchers, I’ve spent some time learning about the different ways of going about this using Postgres, so I jotted down some of the basic ideas.

Postgres has two types of text search: full text search, and trigram-based “fuzzy” search. What makes text special is the many idiosyncrasies of our natural language. In most SQL databases, you like to do very straightforward relational queries, SELECT user_id, display_name WHERE username = 'aapeli', but text is not so easy. There’s a variety of differences that make the problem qutie tricky. Firstly, a word does not always come in the same form, for example Examples! has roughly the same semantic meaning as example (we also have synonyms, but these are much harder). Also, we’re all humans and we make mistakes, so often you will see spelling errors, and it’s nice to have a search engine that can be somewhat resilient to it. One can get away for very basic needs with LIKE and ILIKE, as we did, but it quickly starts limiting the usefulness of search when users really want to find topics of interest. (Note that the trigram module is able to speed up these queries as well through its index.)

This is not a full overview of the power of these tools, rather, it’s a quick teaser and introdution to how they work! I encourage you to go read the full Postgres docs if you find this fascinating, most of the content here is from reading those pages, playing around with these, and implementing a search for Couchers.

Postgres full text search works by trying to normalize words, then match them up exactly. That is, it tires to get around the mesiness of the language by trying to understand it and turn all words with the same semantic meaning into one symbol. These symbols are called lexemes in Postgres. (Each original word may result in multiple lexemes at the end.)

Here’s an example using the default configuration:

# select to_tsvector('This is an example sentence.');
      to_tsvector
------------------------
 'exampl':4 'sentenc':5
(1 row)

Stop words (short words that aren’t useful for search, e.g. ‘this’, ‘is’, ‘an’) are eliminated, punctuation is discarded, and the rest of the words are stemmed, so ‘example’ maps to the lexeme ‘exampl’, and so on. Notice that each lexeme is accompanied with positional information.

This allows indexing large document and matching up words exactly. To query now against this representation, we convert a query string into a tsquery:

# select websearch_to_tsquery('examples of sentences');
 websearch_to_tsquery
----------------------
 'exampl' & 'sentenc'
(1 row)

That is, we’re looking for documents whose normalized forms contain both exampl and sentenc (the & signifies logical “and”). Indeed, we get a match:

# select to_tsvector('This is an example sentence.') @@ websearch_to_tsquery('examples of sentences');
 ?column?
----------
 t
(1 row)

Note that full text search is very language dependent due to the use of language-specific dictionaries. There are a variety of dictionaries available, and Postgres is additionally able to use different standard formats of dictionaries.

In Couchers, we’re currently using English. It’s not therefore good for words that aren’t “Englishy” and not in the dictionary, for example local place names. This causes some headaches for us.

There is a lot more Postgres full text search can do, including weighting different parts of a document (e.g. title, address, content) differently, ranking results with a variety of algorithms, and automatically creating snippets highlighting the matched portions of text. Read more about these on the Postgres docs.

Implementation details in Couchers

We use the Postgres A/B/C/D classes for ranking, where each different searchable type has different sets of A/B/C/D. A is normally the title of the entity, e.g. the name of the community, the name of the user (or their username), page title, etc. B is normally the address or the user’s typed in city. C is any other primary text that’s more important than just the content, and D is all other text, main content, and so forth.

We use websearch_to_tsquery to turn an input into a tsquery. It’s safe to use with user-supplied input, and it has the usual web searchy operators:

Trigram search works by breaking up words into consequtive triplets of characters, then looking these up in an index (though indexing is of course optional). With the right indexes, this scales to insane size and can be even used to implement indexed regular expressions over huge sets of text. Postgres supports this straight out of the box!

The basic search however, just basically needs to count overlapping trigrams.

To get started you need to install the pg_trgm extension which comes pre-packaged with Postgres:

# CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION

You can then get matching! Here’s an example sentence turned decomposed trigrams:

# select show_trgm('This is an example sentence.');
                                                                 show_trgm
-------------------------------------------------------------------------------------------------------------------------------------------
 {"  a","  e","  i","  s","  t"," an"," ex"," is"," se"," th",amp,"an ","ce ",enc,ent,exa,his,"is ","le ",mpl,nce,nte,ple,sen,ten,thi,xam}
(1 row)

A search query with a spelling error:

# select show_trgm('example sentnences');
                                         show_trgm
-------------------------------------------------------------------------------------------
 {"  e","  s"," ex"," se",apl,ces,enc,ent,"es ",exm,"le ",map,nce,nen,ntn,ple,sen,tne,xma}
(1 row)

Now we can match against it:

# select word_similarity('example sentnences', 'This is an example sentence.');
 word_similarity
-----------------
       0.6666667
(1 row)

Note that full text search would not match due to the spelling mistake:

# select to_tsvector('This is an example sentence.') @@ websearch_to_tsquery('example sentnences');
 ?column?
----------
 f
(1 row)

Good on placenames:

# select word_similarity('joens', 'Joensuu, Finland');
 word_similarity
-----------------
       0.8333333
(1 row)

Trigram search is much better for “fuzzy” matching, but it doesn’t understand text semantics at all, so it’s not ideal for proper full text search or when trying to locate topics in text, and it’s a bit trickier to rank large documents. It can also be a bit tricky to get the indexes right: if a search query results in no trigrams, Postgres falls back to a full table scan… ouch!

However, it should be very good for searching short titles/addresses, and we in Couchers use it against the “A” class from above. Meaning basically that we do this kind of trigram matching against usernames, city names, users’ display names, and so on. Things that don’t generally have a lot of “semantic meaning” and where people easily make typos! Read more about pg_trgm on the Postgres docs.

Implementation details in Couchers

We use unaccent first on the query and text to be queried to remove accents and normalize text before trigram matching (which discards non-alphanumeric characters). We match against the A list only. We weight trigram similarity a bit higher than the text search results.

Further reading

To read more about these topics, head over to the well written Postgres docs on Full text search and Trigram matching with the pg_trgm module.