Search in 300 million addresses with pg_trgm

Question

I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search the rows. The final purpose is to implement a search function just like Google Map search.

When I used pg_trgm to search these addresses, it cost about 30s to get the results. There are many rows matching the default similarity threshold condition of 0.3 but I just need about 5 or 10 results. I created a trigram GiST index:

CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);

This is my query:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM addresses 
WHERE address % '981 maun st' 
ORDER BY sml DESC 
LIMIT 10;

The test table on production environment has been removed. I show the EXPLAIN output from my test environment. There are about 7 million rows and it needs about 1.6s to get results. With 300 million, it needs more than 30s.

ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
                                    QUERY PLAN                                                                            
————————————————————————————————————————————————————————————————————————————————    
 Limit  (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
 ->  Sort  (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
     Sort Key: (similarity((address)::text, '781 maun st'::text))
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Index Scan using addresses_trgm_index on addresses  (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
           Index Cond: ((address)::text % '781 maun st'::text)
 Total runtime: 1661.066 ms
(7 rows)

Is there a good way to improve the performance or is it a good plan to do table partitioning?

"... I just need about 5 or 10 results" ... are you placing an appropriate LIMIT on the query? — David Aldridge, Jun 27 '17 at 07:52
Partitioning is available in Postgres 9.3 but is implemented using table inheritance. It is explicitly available in postgres 10. — VynlJunkie, Jun 27 '17 at 09:46
I take it "300MM+" means 300 million? If so, you should consider using ElasticSearch or something similar. — Adam Benson, Jun 27 '17 at 10:33
@DavidAldridge Thanks, I have added a limit. However I need to order by the similarity, so the calculate is very slow. — Gary Tao, Jun 28 '17 at 07:57
@GaryTao I would look at parallel query to see if there's a quick win there. https://www.postgresql.org/docs/9.6/static/parallel-query.html — David Aldridge, Jun 28 '17 at 11:16
@DavidAldridge Yeah, I will also consider to upgrade the database and use the parallel way. — Gary Tao, Jun 29 '17 at 02:54
Please **[EDIT]** your question and add the execution plan generated using **`explain (analyze, verbose)`**. [**Formatted text**](http://stackoverflow.com/help/formatting) please, [no screen shots](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557) — , Jun 29 '17 at 05:46
@a_horse_with_no_name Have updated the question. Would appreciate hearing any suggestion. — Gary Tao, Jun 30 '17 at 03:21

Erwin Brandstetter · Answer 1 · 2021-10-30T05:10:45.707

PostgreSQL 9.3 ... Is there a good way to improve the performance or is it a good plan to do table partitioning?

Table partitioning will not help at all.

But yes, there is a good way: Upgrade to a current version of Postgres. There have been many improvements for GiST indexes, for the pg_trgm module in particular and for big data in general. Should be substantially faster with Postgres 10.

Your "nearest neighbor" search looks correct but for a small LIMIT use this equivalent query instead:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM   addresses 
WHERE  address % '981 maun st' 
ORDER  BY address &lt-> '981 maun st'
LIMIT  10;

Quoting the manual:

It will usually beat the first formulation when only a small number of the closest matches is wanted.

Search in 300 million addresses with pg_trgm

1 Answers1

Linked