Quick Search with autocorrect (GIN INDEX and PG_TRGM extension)

Question

I am testing a simple search mechanism to handle SMALL typos/misspellings. Similar to a autocorrect mechanism.

I am struggling a lot with this. So I am creating a function (pl/pgsql) to handle this, and I am running it on SUPABASE.IO, PostgreSQL 13.3 (similar to RDS).

I would like to:

LIMIT the returned results to only the highly similar email addresses, say similarity > 0.7;
Use an INDEX as the actual list of emails will be in the order of tens of millions, so it has to return in under a second.

DROP TABLE IF EXISTS email;
CREATE TABLE email (
  email_address TEXT NOT NULL UNIQUE,
  person_id UUID NOT NULL, 
  CONSTRAINT email_pk PRIMARY KEY (email_address)
);

DROP INDEX IF EXISTS email_address_trigram_idx;
CREATE INDEX email_address_trigram_idx ON email USING gin(email_address gin_trgm_ops);

INSERT INTO email(email_address, person_id) VALUES
  ('test100@gmail.com', uuid_generate_v4())
, ('100test@gmail.com', uuid_generate_v4())
, ('testoo1000@gmail.com', uuid_generate_v4())
, ('test1001@gmail.com', uuid_generate_v4())
, ('test100@gmial.com', uuid_generate_v4())
, ('test200@gmail.com', uuid_generate_v4())
, ('200test@gmail.com', uuid_generate_v4())
, ('testoo2000@gmail.com', uuid_generate_v4())
, ('test2002@gmail.com', uuid_generate_v4())
, ('test200@gmial.com', uuid_generate_v4())
, ('test300@gmail.com', uuid_generate_v4())
, ('300test@gmail.com', uuid_generate_v4())
, ('testoo3000@gmail.com', uuid_generate_v4())
, ('test3003@gmail.com', uuid_generate_v4())
, ('test300@gmial.com', uuid_generate_v4())
, ('test400@gmail.com', uuid_generate_v4())
, ('400test@gmail.com', uuid_generate_v4())
, ('testoo4000@gmail.com', uuid_generate_v4())
, ('test4004@gmail.com', uuid_generate_v4())
, ('test400@gmial.com', uuid_generate_v4())
, ('tset100@gmail.com', uuid_generate_v4())
, ('100tset@gmail.com', uuid_generate_v4())
, ('tsetoo1000@gmail.com', uuid_generate_v4())
, ('tset1001@gmail.com', uuid_generate_v4())
, ('tset100@gmial.com', uuid_generate_v4())
, ('tset200@gmail.com', uuid_generate_v4())
, ('200tset@gmail.com', uuid_generate_v4())
, ('tsetoo2000@gmail.com', uuid_generate_v4())
, ('tset2002@gmail.com', uuid_generate_v4())
, ('tset200@gmial.com', uuid_generate_v4())
, ('tset300@gmail.com', uuid_generate_v4())
, ('300tset@gmail.com', uuid_generate_v4())
, ('tsetoo3000@gmail.com', uuid_generate_v4())
, ('tset3003@gmail.com', uuid_generate_v4())
, ('tset300@gmial.com', uuid_generate_v4())
, ('tset400@gmail.com', uuid_generate_v4())
, ('400tset@gmail.com', uuid_generate_v4())
, ('tsetoo4000@gmail.com', uuid_generate_v4())
, ('tset4004@gmail.com', uuid_generate_v4())
, ('tset400@gmial.com', uuid_generate_v4())
, ('different_email@yahoo.com', uuid_generate_v4());

SET pg_trgm.similarity_threshold = 0.8; -- This doesn't seem to affect my queries

SELECT *, similarity('tesd100@gmail.com', email_address)
FROM email
WHERE email_address % 'tesd100@gmail.com';

I want a way of searching quickly and still tolerate some minor typos in the search.

I added SET_LIMIT(0.85) to the query and it seems to be working now. I need to test it more though. — jfbaro, Oct 30 '21 at 03:23
I now don't know how to use the % operator in a PL/PGSQL function... I tried <% and it doesn't work. Any tip? — jfbaro, Oct 30 '21 at 03:24
You can use the same SQL statement nested in a PL/pgSQL function. And you don't need the deprecated `SET_LIMIT()` in pg 13. Use `pg_trgm.similarity_threshold` as shown in the question. — Erwin Brandstetter, Oct 30 '21 at 05:23

Erwin Brandstetter · Accepted Answer · 2021-10-30T05:23:33.527

First off, your table definition creates two unique indices on (email_address). Don't. Drop the UNIQUE constraint, keep the PK:

CREATE TABLE email (
  email_address text PRIMARY KEY
, person_id uuid NOT NULL  -- bigint?
);

(Also not sure why you would need uuid for person_id. There aren't nearly enough people in the world to justify more than a bigint.)

Next, since you want to ...

LIMIT the returned results to only the highly similar email addresses,

I suggest a nearest neighbor search. Create a GiST index for this purpose instead of the GIN:

CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);

And use a query like this:

SELECT *, similarity('tesd100@gmail.com', email_address)
FROM   email
WHERE  email_address % 'tesd100@gmail.com'
ORDER  BY email_address <-> 'tesd100@gmail.com'  -- note the use of the operator <->
LIMIT  10;

Quoting the manual:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

While working with a small LIMIT, there is probably no need to set pg_trgm.similarity_threshold very high because this query gives you the best matches first.

Quick Search with autocorrect (GIN INDEX and PG_TRGM extension)

1 Answers1