I am trying to run a query that joins a table against itself and does fuzzy string comparison (using trigram comparisons) to find possible company name matches. My goal is to return records where the trigram similarity of one record's company name (ref_name field) matches another record's company name. Currently, I have my threshold set to 0.9 so it will only bring back matches that are very likely to contain the a similar string.
I know that self joins can result in many comparisons by nature, but I want to optimize my query the best I can. I don't need results instantaneously, but currently the query I am running takes 11 hours to run.
I am running Postgres 9.2 on a Ubuntu 12.04 server. I don't know what the max length of the ref_name field (field I'm matching on) is, so I set it to a varchar(300)
. I wonder if setting it to a text type may affect performance at all or if there is a better field type to use to speed up performance. My LC_CTYPE
and LC_COLLATE
locales are set to "en_US.UTF-8"
The table I am running the query on consists of about 1.6 million records in total, but the query that takes me 11 hours to run is on a small subset of that (about 100k).
Table Structure:
CREATE TABLE ref_name (
ref_name_id integer,
ref_name character varying(300),
ref_name_type character varying(2),
name_display text,
load_date timestamp without time zone
)
Indexes:
CREATE INDEX ref_name_ref_name_trigram_idx ON ref_name
USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops);
CREATE INDEX ref_name_ref_name_trigram_idx_1 ON ref_name
USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops)
WHERE ref_name_type::text = 'E'::text;
CREATE INDEX ref_name_ref_name_e_idx ON ref_name
USING btree (ref_name COLLATE pg_catalog."default")
WHERE ref_name_type::text = 'E'::text;
Query:
select a.ref_name_id as name_id,a.ref_name AS name,
a.name_display AS name_display,b.ref_name_id AS matched_name_id,
b.ref_name AS matched_name,b.name_display AS matched_name_display
from ref_name a
JOIN ref_name b
ON a.ref_name_id<>b.ref_name_id
AND a.ref_name_id>b.ref_name_id
AND a.ref_name % b.ref_name
WHERE
a.ref_name ~>=~ 'A' and a.ref_name ~<~'B'
AND b.ref_name ~>=~ 'A' and b.ref_name ~<~'B'
AND a.ref_name_type='E'
AND b.ref_name_type='E'
Explain Plan:
"Nested Loop (cost=0.00..8560728.16 rows=3598470 width=96)"
" -> Seq Scan on ref_name a (cost=0.00..96556.12 rows=103901 width=48)"
" Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND ((ref_name_type)::text = 'E'::text))"
" -> Index Scan using ref_name_ref_name_trigram_idx_1 on ref_name b (cost=0.00..80.41 rows=35 width=48)"
" Index Cond: ((a.ref_name)::text % (ref_name)::text)"
" Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND (a.ref_name_id <> ref_name_id) AND (a.ref_name_id > ref_name_id))"
Here are some sample records:
1652632;"A 123 SYSTEMS";"E";"A 123 SYSTEMS INC";"2014-11-14 00:00:00"
1652633;"A123 SYSTEMS";"E";"A123 SYSTEMS INC";"2014-11-14 00:00:00"
1652640;"A 1 ACCOUSTICS";"E";"A-1 ACCOUSTICS";"2014-11-14 00:00:00"
1652641;"A 1 ACOUSTICS";"E";"A-1 ACOUSTICS";"2014-11-14 00:00:00"
1652642;"A1 ACOUSTICS";"E";"A1 ACOUSTICS INC";"2014-11-14 00:00:00"
1652650;"A 1 A ELECTRICAL";"E";"A-1 A ELECTRICAL INC";"2014-11-14 00:00:00"
1652651;"A 1 A ELECTRICIAN";"E";"A 1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652652;"A 1A ELECTRICIAN";"E";"A 1A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652653;"A1 A ELECTRICIAN";"E";"A1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1691270;"ALBERT GARLATTI";"E";"ALBERT GARLATTI";"2014-11-14 00:00:00"
1691271;"ALBERT GARLATTI CONSTRUCTION";"E";"ALBERT GARLATTI CONSTRUCTION CO";"2014-11-14 00:00:00"
1680892;"AG HOG PITTSBURGH";"E";"AG-HOG PITTSBURGH CO INC";"2014-11-14 00:00:00"
1680893;"AGHOG PITTSBURGH";"E";"AGHOG PITTSBURGH CO";"2014-11-14 00:00:00"
1680928;"AGILE PURSUITS FRACHISING";"E";"AGILE PURSUITS FRACHISING INC";"2014-11-14 00:00:00"
1680929;"AGILE PURSUITS FRANCHISING";"E";"AGILE PURSUITS FRANCHISING INC";"2014-11-14 00:00:00"
1680956;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"2014-11-14 00:00:00"
1680957;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"2014-11-14 00:00:00"
As you can see, I created a gist trigram index to speed things up (tried two different types so far for comparison). Does anyone have any suggestions on how I can improve the performance of this query and get it down from 11 hours to something more manageable? Eventually I would like to run this query on the whole table to compare records, not just this small subset.