Optimizing a postgres similarity query (pg_trgm + gin index)

Question

I have defined the following index:

CREATE INDEX
    users_search_idx
ON
    auth_user
USING
    gin(
        username gin_trgm_ops,
        first_name gin_trgm_ops,
        last_name gin_trgm_ops
    );

I am performing the following query:

PREPARE user_search (TEXT, INT) AS
    SELECT
        username,
        email,
        first_name,
        last_name,
        ( -- would probably do per-field weightings here
            s_username + s_first_name + s_last_name
        ) rank
    FROM
        auth_user,
        similarity(username, $1) s_username,
        similarity(first_name, $1) s_first_name,
        similarity(last_name, $1) s_last_name
    WHERE
        username % $1 OR
        first_name % $1 OR
        last_name % $1
    ORDER BY
        rank DESC
    LIMIT $2;

The auth_user table has 6.2 million rows.

The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity query.

Increasing the similarity threshold via set_limit helps, but reduces usefulness of results by eliminating partial matches.

Some searches return in 200ms, others take ~ 10 seconds.

We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.

I would like to know if there is any way we could improve this to get more consistent performance?

It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.

An EXPLAIN ANALYZE EXECUTE user_search('mel', 20) shows:

Limit  (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
  ->  Sort  (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
        Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Nested Loop  (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
              ->  Nested Loop  (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
                    ->  Nested Loop  (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
                          ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
                                Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
                                Rows Removed by Index Recheck: 2434523
                                Heap Blocks: exact=49337 lossy=53104
                                ->  BitmapOr  (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
                                            Index Cond: ((username)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
                                            Index Cond: ((first_name)::text % 'mel'::text)
                                      ->  Bitmap Index Scan on users_search_idx  (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
                                            Index Cond: ((last_name)::text % 'mel'::text)
                          ->  Function Scan on similarity s_username  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
                    ->  Function Scan on similarity s_first_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
              ->  Function Scan on similarity s_last_name  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms

Server is Postgres 9.6.1 running on Amazon RDS

update

1.

Shortly after posting the question I found this info: https://www.postgresql.org/message-id/464F3C5D.2000700@enterprisedb.com

So I tried

-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)

This made a big improvement (previously > 10s)!

1.5s is still way slower than ES for similar query so I would still like to hear any suggestions for optimising the query.

2.

In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.

Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'. Increasing work_mem made no difference.

From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.

3.

I still don't understand why so much time is spent in:

 ->  Bitmap Heap Scan on auth_user  (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
     Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
     Rows Removed by Index Recheck: 2434523
     Heap Blocks: exact=49337 lossy=53104

I don't understand why this step is needed or what it's doing.

There are the three Bitmap Index Scan beneath it for each of the username % $1 clauses... these results then get combined with a BitmapOr step. These parts are all quite fast.

But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan.

So... the huge majority of the time is spent in costly `Bitmap Heap Scan on auth_user Recheck Cond`. For some reason this has to remove 2.4 million rows, although the inner Index Scans return < 300k rows total. I don't understand what is happening here — Anentropic, May 09 '17 at 11:35
Why are you doing the similarity calls in the FROM clause? You don't appear to be doing anything with them apart from in the SELECT clause. Note that the total time used there is the "actual time" * "loops" although I wouldn't put too much faith in the figure for "actual time" - it's too small to be a reliable figure. — Richard Huxton, May 09 '17 at 11:53
similarity in the FROM is something I got from an example query somewhere... it makes it more readable. I tried moving into the SELECT portion but no difference in performance. The actual time on this query is 10s... for some searches I am getting the worst case exactly as reported in the EXPLAIN above — Anentropic, May 09 '17 at 12:04
I have actually just found something that makes a big difference ... `SET work_mem='8MB'` (from 4MB) cuts the worst case down to ~1.3s. So I guess the main problem is as described here https://www.postgresql.org/message-id/464F3C5D.2000700@enterprisedb.com — Anentropic, May 09 '17 at 12:06
Have you tried the `gist` index method (instead of `gin`)? -- Is a [*word-similarity*](https://www.postgresql.org/docs/current/static/pgtrgm.html) over the `CONCAT(username, ' ', first_name, ' ', last_name)` expression acceptable for you? (It works a little bit different from what you use, but it should be searched/ordered faster than your current index). — pozs, May 09 '17 at 12:15
@pozs I did briefly experiment with word similarity operators but they seemed even slower. I have created a GIST index with identical spec but so far the query planner does not choose it. I may try dropping the GIN index to compare — Anentropic, May 09 '17 at 13:37
@Anentropic yes, having both might confuse the planner -- a single, expression-based index would eliminate the need for `BitmapOr` (under `Bitmap Heap Scan`). — pozs, May 09 '17 at 13:43
the `Bitmap Heap Scan` is by far the slowest part of the query though — Anentropic, May 09 '17 at 13:53
same queries using identical GIST index: worst case time ~3.7s (work_mem=4MB) ...no improvement with work_mem=8MB — Anentropic, May 09 '17 at 14:32
Great question, I would very much like to upvote it but my policy prohibits me from upvoting open but already answered questions. — raratiru, Aug 17 '18 at 21:40
@raratiru I was hoping for other answers. Are you saying you think I should have accepted the first one received? — Anentropic, Aug 18 '18 at 08:22
No not at all, apologies for not reading the comments in the answer! I am currently digging into this issue of trying to optimize such a query. — raratiru, Aug 18 '18 at 11:32

score 16 · Accepted Answer · edited Jun 20 '20 at 09:12

I expect much faster results with this approach:

1.

Create a GiST index with 1 column holding concatenated values:

CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);

This assumes all 3 columns to be defined NOT NULL (you did not specify). Else you need to do more.
Why not simplify with concat_ws()?

2.

Use a proper nearest-neighbor query, matching above index:

SELECT username, email, first_name, last_name
     , similarity(username  , $1) AS s_username
     , similarity(first_name, $1) AS s_first_name
     , similarity(last_name , $1) AS s_last_name
     , row_number() OVER () AS rank  -- greatest similarity first
FROM   auth_user
WHERE     (username || ' ' || first_name || ' ' || last_name) %   $1  -- !!
ORDER  BY (username || ' ' || first_name || ' ' || last_name) <-> $1  -- !!
LIMIT  $2;

Expressions in WHERE and ORDER BY must match index expression!

In particular ORDER BY rank (like you had it) will always perform poorly for a small LIMIT picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.

row_number() with empty window definition just reflects the ordering produced by the ORDER BY of the same SELECT.

Optimizing a postgres similarity query (pg_trgm + gin index)

update

1 Answers1

1.

2.

Linked