4

I'm currently working on a search feature that ends up hitting the db with a LIKE query. It used to be of the form WHERE some_id = blah AND some_timestamp > blah AND (field1 LIKE '%some_text%' OR field2 LIKE '%some_text%' OR ...) ORDER BY some_timestamp DESC. This hasn't scaled well now that the table is in the size of tens of millions of rows, especially when it is filtered on a very old timestamp.

After some research it looked like a trigram index might be more performant for the text searching. So I added a trigram index on all of the text fields concatenated and got good results initially. After varying the new query though I found a regression. An old index (btree on some_id and some_timestamp DESC) was not being hit anymore. So the new text searching helps with certain text queries that used to be very slow, and other text queries that used to be very fast (a few ms) due to the btree index are now super slow (see below).

Is there a way to get the best of both worlds? Fast trigram text searching and fast btree indexing for the queries that need it?

Notes:

  • Postgres 11.6

  • I tried a btree_gin index to index the timestamp column as well but got pretty much the same performance.

  • I slightly modified my query (concatenated whitespace) to bypass the trigram index and verified the slow queries return to the btree index and <10ms execution times.

  • I tried some query rearrangement to try to get both indexes hit to no avail.

Table:

table1
---------------------------------
some_id        | bigint
field1         | text
field2         | text
field3         | text
field4         | text
field5         | text
field6         | bigint
some_timestamp | timestamp without time zone

Trigram Index:

CREATE INDEX CONCURRENTLY IF NOT EXISTS trgm_idx ON table1 USING gin ((COALESCE(field1, '') || ' ' || COALESCE(field2, '') || COALESCE(field3, '') || ' ' || COALESCE(field4, '') || ' ' || COALESCE(field5, '') || ' ' || field6::text) gin_trgm_ops);

Query:

SELECT *
FROM table1 i
WHERE i.some_id = 1
    AND (COALESCE(field1, '') || ' ' || COALESCE(field2, '') || COALESCE(field3, '') || ' ' || COALESCE(field4, '') || ' ' || COALESCE(field5, '') || ' ' || field6::text) ILIKE '%some_text%'
    AND i.some_timestamp > '2015-01-00 00:00:00.0'
ORDER BY some_timestamp DESC limit 20;

Explain:

 Limit  (cost=1043.06..1043.11 rows=20 width=446) (actual time=37240.094..37240.099 rows=20 loops=1)
   ->  Sort  (cost=1043.06..1043.15 rows=39 width=446) (actual time=37240.092..37240.095 rows=20 loops=1)
         Sort Key: some_timestamp
         Sort Method: top-N heapsort  Memory: 36kB
         ->  Bitmap Heap Scan on table1 i  (cost=345.01..1042.03 rows=39 width=446) (actual time=1413.415..37202.331 rows=83066 loops=1)
               Recheck Cond: ((((((((((COALESCE(field1, ''::text) || ' '::text) || COALESCE(field2, ''::text)) || COALESCE(field3, ''::text)) || ' '::text) || COALESCE(field4, ''::text)) || ' '::text) || COALESCE(field5, ''::text)) || ' '::text) || (field6)::text) ~~* '%some_text%'::text)
               Rows Removed by Index Recheck: 23
               Filter: ((some_timestamp > '2015-01-00 00:00:00'::timestamp without time zone) AND (some_id = 1))
               Rows Removed by Filter: 5746666
               Heap Blocks: exact=395922
               ->  Bitmap Index Scan on trgm_idx  (cost=0.00..345.00 rows=667 width=0) (actual time=1325.867..1325.867 rows=5833670 loops=1)
                     Index Cond: ((((((((((COALESCE(field1, ''::text) || ' '::text) || COALESCE(field2, ''::text)) || COALESCE(field3, ''::text)) || ' '::text) || COALESCE(field4, ''::text)) || ' '::text) || COALESCE(field5, ''::text)) || ' '::text) || (field6)::text) ~~* '%some_text%'::text)
 Planning Time: 0.252 ms
 Execution Time: 37243.205 ms
(14 rows)
user3112658
  • 121
  • 2
  • 8

2 Answers2

0

I slightly modified my query (concatenated whitespace) to bypass the trigram index and verified the slow queries return to the btree index and <10ms execution times.

This is ugly, but it is almost certainly the solution. Maybe we can fix things so that some (far) future version of PostgreSQL will not need such fixes, but that won't help you today.

Bitmap Index Scan on trgm_idx  (cost=0.00..345.00 rows=667 width=0) (actual time=1325.867..1325.867 rows=5833670 loops=1)

This estimate is clearly deranged, and that is probably the source of the problem. But trigram indexes and ILIKE queries are very sensitive to the actual query text. Having only an (apparently) anonymized value '%some_text%' is not enough to take a deeper look.

And another approach would be use to use GiST rather than GIN. GiST indexes can beneficially use multiple columns simultaneously.

CREATE INDEX CONCURRENTLY IF NOT EXISTS trgm_idx ON table1 USING gist
   (some_id, some_timestamp, (COALESCE(field1, '') || ' ' || COALESCE(field2, '') || COALESCE(field3, '') || ' ' || COALESCE(field4, '') || ' ' || COALESCE(field5, '') || ' ' || field6::text) gist_trgm_ops);

You might want to include only one or the other of the first two columns, depending on how much selectivity each one provides. You will probably need to do some experiments.

I don't like using GiST with pg_trgm, I find the performance (both in use and in building) erratic. But for a useful multi-column index in this fashion, you have little other choice.

Anyway, you have an index that works well already, it just isn't using it. Making the GiST index might be "good enough" to lure the query away from GIN, but then it might just make some other query pick the wrong plan.

Another approach would be to use RUM indexes, this allows you to sort with data stored in the index, but I think you would have to write some code to make them support pg_trgm.

jjanes
  • 37,812
  • 5
  • 27
  • 34
-1

Create an additional index:

CREATE INDEX ON table1 (some_id, some_timestamp);

Then you have a good chance of getting a bitmap or of a scan on both indexes, which should be much faster than having over 5 million rows removed by a filter.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • That's what I meant by "An old index (btree on some_id and some_timestamp DESC) was not being hit anymore". I already have an index on those columns, but it's not getting hit anymore with the new index in place. – user3112658 Feb 19 '20 at 16:58
  • Then the problem is that PostgreSQL mis-estimates the number of result rows and decides not to use the other index. I don't know if PostgreSQL collects meaningful statistics for trigrams. Does `ANALYZE` with a higher setting of `default_statistics_target` help? If not, then I am out of options. – Laurenz Albe Feb 19 '20 at 17:04
  • Well running `ANALYZE` after bumping `default_statistics_target` did do something. Now the planner is falling back to the old btree index and not using the trigrams at all. Very interesting. I wonder if the planner is just not smart enough to use both of these indexes at the same time. – user3112658 Feb 19 '20 at 19:34
  • @user3112658 Presumably the old btree index is being used to fetch the data in order of the ORDER BY, then stops early once the LIMIT is satisfied. This type of index scan (to supply order) cannot be used with a 2nd index on the same table to supply selectivity. It isn't just the planner. The executor doesn't currently know how to implement it, so there is no point in planning it. – jjanes Feb 19 '20 at 22:11