0

I am trying to do a bulk similarity query in Postgres using the pg_trgm extension. I have set up a GIN index on the column, but it seems like it is never used. Additionally, the docs don't seem to imply that a GiST index would help much either.

In my query I have a bunch of names, and I want to do a fuzzy match to see which names in the DB are similar:

Table:

CREATE TABLE people(
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name TEXT NOT NULL
);

Table size:

SELECT COUNT(*) FROM public.people;
-- 51436173

The Index:

CREATE INDEX trgm_idx ON public.people USING GIST (name public.gist_trgm_ops);

My query takes a set of names to search for:

EXPLAIN (ANALYZE, VERBOSE)
SELECT id, people.name, keys.name, similarity(people.name, keys.name)
FROM people JOIN (VALUES ('alice'), ('bob'), ('charlie')) keys(name)
ON similarity(people.name, keys.name) >= 0.75;

The EXPLAIN:

+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop  (cost=0.00..3742858.40 rows=51427344 width=67) (actual time=3415.321..383741.953 rows=73 loops=1)                    |
|  Output: people.id, people.name, "*VALUES*".column1, similarity(people.name, "*VALUES*".column1)                                 |
|  Join Filter: (similarity(people.name, "*VALUES*".column1) >= '0.75'::double precision)                                          |
|  Rows Removed by Join Filter: 154308446                                                                                          |
|  ->  Seq Scan on public.people  (cost=0.00..914354.44 rows=51427344 width=31) (actual time=1.921..9403.147 rows=51436173 loops=1)|
|        Output: people.id, people.name                                                                                            |
|  ->  Materialize  (cost=0.00..0.05 rows=3 width=32) (actual time=0.000..0.000 rows=3 loops=51436173)                             |
|        Output: "*VALUES*".column1                                                                                                |
|        ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=32) (actual time=0.075..0.081 rows=3 loops=1)                |
|              Output: "*VALUES*".column1                                                                                          |
|Planning Time: 3.580 ms                                                                                                           |
|JIT:                                                                                                                              |
|  Functions: 6                                                                                                                    |
|  Options: Inlining true, Optimization true, Expressions true, Deforming true                                                     |
|  Timing: Generation 12.076 ms, Inlining 9.587 ms, Optimization 24.373 ms, Emission 15.302 ms, Total 61.338 ms                    |
|Execution Time: 383756.056 ms                                                                                                     |
+----------------------------------------------------------------------------------------------------------------------------------+

The inline VALUES table will contain all the names I want to do a fuzzy search for. The problem I encounter is that the query is very slow, and does a parallel sequential scan over the whole people table. My question is how to get Postgres to use the GIN (or GiST index) for most of the queries? I have VACUUM ANALYZE the table already, and the table contains a couple million rows.

EDIT: Query and Explain from Laurenz Albe's Answer:


BEGIN;
SET LOCAL enable_seqscan = off;
SET LOCAL pg_trgm.similarity_threshold = 0.75;
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, people.name, keys.name, similarity(keys.name, people.name)
FROM people
    JOIN (VALUES ('alice'), ('bob'), ('charlie'), ('dani')) keys(name)
      ON people.name % keys.name;
COMMIT;


+--------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop  (cost=4758.47..3668792.68 rows=2058049 width=67) (actual time=2419.249..32730.893 rows=75 loops=1)                       |
|  Buffers: shared hit=13120 read=144894                                                                                               |
|  ->  Values Scan on "*VALUES*"  (cost=0.00..0.05 rows=4 width=32) (actual time=0.461..0.745 rows=4 loops=1)                          |
|  ->  Bitmap Heap Scan on people  (cost=4758.47..910766.76 rows=514512 width=31) (actual time=5615.752..8182.340 rows=19 loops=4)     |
|        Recheck Cond: (name % "*VALUES*".column1)                                                                                     |
|        Rows Removed by Index Recheck: 1122368                                                                                        |
|        Heap Blocks: exact=102969 lossy=33951                                                                                         |
|        Buffers: shared hit=13120 read=144894                                                                                         |
|        ->  Bitmap Index Scan on trgm_idx2  (cost=0.00..4629.84 rows=514512 width=0) (actual time=710.005..710.007 rows=37618 loops=4)|
|              Index Cond: (name % "*VALUES*".column1)                                                                                 |
|              Buffers: shared hit=12793 read=8286                                                                                     |
|Planning:                                                                                                                             |
|  Buffers: shared read=1                                                                                                              |
|Planning Time: 17.039 ms                                                                                                              |
|JIT:                                                                                                                                  |
|  Functions: 5                                                                                                                        |
|  Options: Inlining true, Optimization true, Expressions true, Deforming true                                                         |
|  Timing: Generation 38.873 ms, Inlining 75.855 ms, Optimization 163.283 ms, Emission 156.729 ms, Total 434.740 ms                    |
|Execution Time: 32787.630 ms                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------+

Carl Mastrangelo
  • 5,970
  • 1
  • 28
  • 37
  • You are right `pg_trgm` does have trig based functions but I don't thing `similarity()` is one of them. I could be wrong though. I am 80% sure that you could use `similarity()` with `%` and it might work out a bit better. – PCDSandwichMan Aug 01 '23 at 21:22
  • They come from here: https://www.postgresql.org/docs/current/pgtrgm.html#PGTRGM-FUNC-TABLE – Carl Mastrangelo Aug 01 '23 at 21:29
  • Could you please share the results from explain(analyze, verbose, buffers, settings) for this SQL statement? (in plain text, as an update to your original question) And the DDL for the indexes? – Frank Heikens Aug 01 '23 at 21:31
  • 1
    @CarlMastrangelo sorry about the mix up. What I mean is when it comes to indexing and query optimization, using the similarity() function directly in the JOIN clause doesn't utilize the GIN index. – PCDSandwichMan Aug 01 '23 at 21:37
  • This post might be more straightforward than trying to understand my clutter https://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations – PCDSandwichMan Aug 01 '23 at 21:39
  • This might help as well: https://dba.stackexchange.com/questions/103821/best-index-for-similarity-function – PCDSandwichMan Aug 01 '23 at 21:42
  • @Frank Heikens : I added the explain – Carl Mastrangelo Aug 02 '23 at 17:09

1 Answers1

2

You have to rewrite the query if you want it to use an index:

BEGIN;

SET LOCAL pg_trgm.similarity_threshold = 0.75;

SELECT id, people.name, keys.name, similarity(people.name, keys.name)
FROM people
   JOIN (VALUES ('alice'), ('bob'), ('charlie')) keys(name)
      ON people.name % keys.name; 

COMMIT;
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • This worked out much better ( 2x fast) but doesnt appear to use the index. I added the explain to my question – Carl Mastrangelo Aug 02 '23 at 18:05
  • Weird. Did you 'ANALYZE' the table? What happens if you `SET enable_seqscan = off` first? – Laurenz Albe Aug 02 '23 at 18:18
  • VACUUM ANALYZE people; Didn't seem to affect it. Disabling seqscan results in -> Bitmap Heap Scan on people (cost=34116.18..1031148.23 rows=514378 width=31) (actual time=137964.605..137972.049 rows=24 loops=3) | Buffers: shared hit=585 read=2234487 | | -> Bitmap Index Scan on trgm_idx (cost=0.00..33987.59 rows=514378 width=0) (actual time=137900.906..137900.906 rows=24 loops=3)| – Carl Mastrangelo Aug 02 '23 at 21:16
  • I cannot read that. Can you add the complete plan to the question? – Laurenz Albe Aug 03 '23 at 02:10
  • updated the answer with enable_seqscan OFF – Carl Mastrangelo Aug 03 '23 at 18:12
  • 1
    It seems like your words contain very frequent trigrams (Rows Removed by Index Recheck: 1122368). Still, the time is much better. Does increasing `effective_cache_size` or lowering `random_page_cost` get you the better plan? – Laurenz Albe Aug 03 '23 at 19:11
  • I think it may be possible to shrink the index by adding a conditional on the index based on additional fields (that were previously irrelevant to the question). I tried out changing these values, but my PG seemed to crash with them set. – Carl Mastrangelo Aug 07 '23 at 17:44
  • random_page_cost did vastly improve the query plan, and ended up solving my issue. – Carl Mastrangelo Aug 23 '23 at 01:17