Before diving into how to use PostgreSQL and Arel I’d like to underline that you’ll be able to apply what you’ll read here to project that don’t even use Arel or Rails. Main features needed are provided by PostgreSQL itself so you can use it in PHP, Elixir or whatever language / framework you’re using.
What is full text search?
Full Text Search provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.
It gives you the ability to find documents containing your query terms and return them ordered by their relevance.
It is a more evolved search than just matching a regular expression for example. It would be hard with regex to find matches for derived words, ignore useless words or sort them by relevance.
Full text search is done by pre-processing documents and index them efficiently to allow fast search afterward. Pre-processing consists of parsing documents to split it into meaningful tokens, normalize and store them.
Avoiding useless dependencies
For a long time I was used to add ElasticSearch to my stack as soon as I add the most basic need for full text search. ElasticSearch is really an awesome piece of software. It’s freaking powerful and blazing fast. If you never used it you can’t even imagine. There’s so much to say about it that I could write dozen of post about it.
That said I’m now trying to avoid adding libs and dependencies to my apps whenever I can. Implementing basic to average full text search can be achieve without using ElasticSearch.
So can I rely on PostgreSQL to cover my basic to average needs when it comes to full text search? Yes! PostgreSQL provides the tools to do the most common things. It has dictionaries you can choose from to eliminate words that should not be considered in a search (stop words). It gives the ability to choose the parser used to split your document into tokens and how to normalize them.
Tokenizing a document
PostgreSQL provides two main function to tokenize a document, remove stop words then stem (basically index their position in the document) them.
These functions are
to_tsquery. Be aware that
PostgreSQL will use your current locale by default to detect
stop words and normalize tokens. You can provide the desired language
as the first argument. Let’s see how it works:
As you can see useful words were normalized and their positions in the document were indexed. “you”, “in”, “a”, “and” and “the” were removed since they are not valuable in a search.
Searching for a match
PostgreSQL provides the
to_tsquery to query a tokenized
Here’s how to use it:
This is the simplest query you can make. We’re searching for the word
“feared” in the document or something similar. In this example it does
match and the query returns
Since it’s a full text search engine we can also do a boolean search:
That nice, it seems like we have a real full text search engine right in our PostgreSQL. But how do we search in tables and columns data?
Searching in tables
If you’re reading this I bet that your goal is not to ask your PostgreSQL to search for text in static content you provide to it.
Let see how we can do full text search in data of our tables. In this
example we’ll pretend that we have a
posts table and that we want to
match every record that has a given string (or something similar) in
It would return all the matching record where the body or title would contains “musician”, “music” or something with the same root.
The querying function let you search boolean expressions so you can use AND (&), OR (|), negation (!) and you can even add a weight on a word of the query (from A to D). You can read more about that in the Parsing Queries section of the documentation.
You can do much more like ranking the results and highlighting matches but that’s not the topic of this post.
What about the performance?
That’s nice, I’m trying to sell PostgreSQL full text search capabilities but is it fast enough?
To be fair — by default — it’s not blazing fast especially when you add more terms to you query. Each term will add latency to your query time. But that’s SQL and most of the time the solution is to add an index. You can drastically speed up the querying time by adding indexes on the columns that are used in your query.
To add an index on our previous columns we’ll do:
By adding these indexes you’ll drop considerably the query time. A factor of something like 2000 times! A query that was taking 2 sec on a large table will drop to something like 1 ms. The most term your query have the more you’ll gain.
Using full text search through Arel
Arel doesn’t come with built-in support for creating full text search related queries.
That’s not a big deal since it’s pretty simple to create our own functions, nodes, operations in Arel.
Will need a way to tokenize our documents by using
also a way to query the tokenized documents by using