2

I'm searching several million names and addresses in a Postgres table. I'd like to use pg_trgm to do fast fuzzy search.

My app is actually very similar to the one in Optimizing a postgres similarity query (pg_trgm + gin index), and the answer there is pretty good.

My problem is that the relevance ranking isn't very good. There are two issues:

  1. I want names to get a heavier weight in the ranking than addresses, and it's not clear how to do that and still get good performance. For example, if a user searches for 'smith', I want 'Bob Smith' to appear higher in the results than '123 Smith Street'.

  2. The current results are biased toward columns that contain fewer characters. For example, a search for 'bob' will rank 'Bobby Smith' (without an address) above 'Bob Smith, 123 Bob Street, Smithville Illinois, 12345 with some other info here'. The reason for this is that the similarity score penalizes for parts of the string that do not match the search terms.

I'm thinking that I'll get a much better result if I could get a score that simply returns the number of matched trigrams in a record, not the number of trigrams scaled by the length of the target string. That's the way most search engines (like Elastic) work -- they rank by the weighted number of hits and do not penalize long documents.

Is it possible to do this with pg_trgm AND get good (sub-second) performance? I could do an arbitrary ranking of results, but if the ORDER BY clause does not match the index, then performance will be poor.

ccleve
  • 15,239
  • 27
  • 91
  • 157
  • 1
    I think there are only the `%`, `<%` and `<<%` operators (and the respective distance functions) that can be supported by an index. – Laurenz Albe Feb 17 '21 at 08:24

1 Answers1

0

I know that this is an old question but this might be useful for others.

If the text you want to search falls in ascii table (characters in the range of [a-zA-Z0-9] and some other symbols), then you probably want to use Full Text Search feature (read official document Full Text Search).

because not only it gives you ability to sort by relevancy, but also ability to customize things like text steming (using snowball algorithm), which maps words like connection, connections, connective, connected, and connecting to connect (Read more about Snowball Stem). This makes your application performs better on search.

But if your requirement is to search text out of range of ascii table, like unicode, which is common if you try to support Asian languages like Japanese, Thai, Korean, etc. then using pg_trgm is perfectly fine.

To do the search that is not biased to the shorter text as mentioned in the question, you could use word_similarity(), instead of similarity().

As per the official documentation:

word_similarity( text, text ) Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.

So for example:

postgres=# SELECT word_similarity('white cat', 'white dog and black cat') as "similarity 1", word_similarity('white cat', 'I have a white dog and a black cat') as "similarity 2", word_similarity('white cat', 'I have a lovely white dog and a cute big black cat in a house') as "similarity 3";
 similarity 1 | similarity 2 | similarity 3
--------------+--------------+--------------
          0.6 |          0.6 |          0.6
(1 row)

As shown above, they all have equal scores.

And when you want to use it in a query:

SELECT col, word_similarity('some query', col) from my_table where col <% 'some query';

According to the document:

text <% text → boolean Returns true if the similarity between the trigram set in the first argument and a continuous extent of an ordered trigram set in the second argument is greater than the current word similarity threshold set by pg_trgm.word_similarity_threshold parameter.

For somehting more complicated like calculating hit scores, relevance weight/boost, and faster response time on larger dataset, you should use Elastic instead, but keep in mind that Elastic instance needs at least 2GB of ram and more, so you need dedicated EC2 instance(s) for that purpose. But for small-medium app, pg_trgm works just fine while saving your server cost.

Hope you find this helpful.

Pakpoom Tiwakornkit
  • 2,601
  • 1
  • 20
  • 19