I am trying to do a bulk similarity query in Postgres using the pg_trgm extension. I have set up a GIN index on the column, but it seems like it is never used. Additionally, the docs don't seem to imply that a GiST index would help much either.
In my query I have a bunch of names, and I want to do a fuzzy match to see which names in the DB are similar:
Table:
CREATE TABLE people(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL
);
Table size:
SELECT COUNT(*) FROM public.people;
-- 51436173
The Index:
CREATE INDEX trgm_idx ON public.people USING GIST (name public.gist_trgm_ops);
My query takes a set of names to search for:
EXPLAIN (ANALYZE, VERBOSE)
SELECT id, people.name, keys.name, similarity(people.name, keys.name)
FROM people JOIN (VALUES ('alice'), ('bob'), ('charlie')) keys(name)
ON similarity(people.name, keys.name) >= 0.75;
The EXPLAIN:
+----------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+----------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop (cost=0.00..3742858.40 rows=51427344 width=67) (actual time=3415.321..383741.953 rows=73 loops=1) |
| Output: people.id, people.name, "*VALUES*".column1, similarity(people.name, "*VALUES*".column1) |
| Join Filter: (similarity(people.name, "*VALUES*".column1) >= '0.75'::double precision) |
| Rows Removed by Join Filter: 154308446 |
| -> Seq Scan on public.people (cost=0.00..914354.44 rows=51427344 width=31) (actual time=1.921..9403.147 rows=51436173 loops=1)|
| Output: people.id, people.name |
| -> Materialize (cost=0.00..0.05 rows=3 width=32) (actual time=0.000..0.000 rows=3 loops=51436173) |
| Output: "*VALUES*".column1 |
| -> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=32) (actual time=0.075..0.081 rows=3 loops=1) |
| Output: "*VALUES*".column1 |
|Planning Time: 3.580 ms |
|JIT: |
| Functions: 6 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
| Timing: Generation 12.076 ms, Inlining 9.587 ms, Optimization 24.373 ms, Emission 15.302 ms, Total 61.338 ms |
|Execution Time: 383756.056 ms |
+----------------------------------------------------------------------------------------------------------------------------------+
The inline VALUES table will contain all the names I want to do a fuzzy search for. The problem I encounter is that the query is very slow, and does a parallel sequential scan over the whole people
table. My question is how to get Postgres to use the GIN (or GiST index) for most of the queries? I have VACUUM ANALYZE the table already, and the table contains a couple million rows.
EDIT: Query and Explain from Laurenz Albe's Answer:
BEGIN;
SET LOCAL enable_seqscan = off;
SET LOCAL pg_trgm.similarity_threshold = 0.75;
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, people.name, keys.name, similarity(keys.name, people.name)
FROM people
JOIN (VALUES ('alice'), ('bob'), ('charlie'), ('dani')) keys(name)
ON people.name % keys.name;
COMMIT;
+--------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------------------------------------------------------------------+
|Nested Loop (cost=4758.47..3668792.68 rows=2058049 width=67) (actual time=2419.249..32730.893 rows=75 loops=1) |
| Buffers: shared hit=13120 read=144894 |
| -> Values Scan on "*VALUES*" (cost=0.00..0.05 rows=4 width=32) (actual time=0.461..0.745 rows=4 loops=1) |
| -> Bitmap Heap Scan on people (cost=4758.47..910766.76 rows=514512 width=31) (actual time=5615.752..8182.340 rows=19 loops=4) |
| Recheck Cond: (name % "*VALUES*".column1) |
| Rows Removed by Index Recheck: 1122368 |
| Heap Blocks: exact=102969 lossy=33951 |
| Buffers: shared hit=13120 read=144894 |
| -> Bitmap Index Scan on trgm_idx2 (cost=0.00..4629.84 rows=514512 width=0) (actual time=710.005..710.007 rows=37618 loops=4)|
| Index Cond: (name % "*VALUES*".column1) |
| Buffers: shared hit=12793 read=8286 |
|Planning: |
| Buffers: shared read=1 |
|Planning Time: 17.039 ms |
|JIT: |
| Functions: 5 |
| Options: Inlining true, Optimization true, Expressions true, Deforming true |
| Timing: Generation 38.873 ms, Inlining 75.855 ms, Optimization 163.283 ms, Emission 156.729 ms, Total 434.740 ms |
|Execution Time: 32787.630 ms |
+--------------------------------------------------------------------------------------------------------------------------------------+