I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:
=> \d addresses
Table "public.addresses"
Column | Type | Modifiers
---------+------+-----------
address | text |
I also have a table of location data:
=> \d locations
Table "public.locations"
Column | Type | Modifiers
------------+------+-----------
id | text |
country | text |
postalcode | text |
latitude | text |
longitude | text |
Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:
EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
SELECT * FROM locations
WHERE address ilike '%' || postalcode || '%'
ORDER BY LENGTH(postalcode) DESC
LIMIT 1
) AS l ON true;
That gave the expected result, but it was slow. Here's the query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=18383.07..18540688323.77 rows=1008572 width=91)
-> Seq Scan on addresses a (cost=0.00..20997.72 rows=1008572 width=56)
-> Limit (cost=18383.07..18383.07 rows=1 width=35)
-> Sort (cost=18383.07..18391.93 rows=3547 width=35)
Sort Key: (length(locations.postalcode))
-> Seq Scan on locations (cost=0.00..18365.33 rows=3547 width=35)
Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))
I tried adding a gist trigram index to the address column, like mentioned at https://stackoverflow.com/a/13452528/36191, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.
CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);
I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER
or LIMIT
:
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=39.35..129156073.06 rows=3577682241 width=86)
-> Seq Scan on locations (cost=0.00..12498.55 rows=709455 width=28)
-> Bitmap Heap Scan on addresses a (cost=39.35..131.60 rows=5043 width=58)
Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
-> Bitmap Index Scan on idx_address (cost=0.00..38.09 rows=5043 width=0)
Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?