2

I have two large tables and I need to fit them together. Matching should not be a clear comparison. I used trigrams, Levenshtein's formula, but I get very weak performance. Maybe someone can help improve performance. The size of table A is about 200 thousand rows, the size of table B is about 600 thousand rows.

   CREATE TABLE TBL_A(NAME VARCHR,SURNAME VARCHAR, BIRTH_DATE DATE, TABLE_B_ID INT4);
   CREATE TABLE TBL_B(ID INT4, NAME VARCHR, SURNAME VARCHAR, BIRTH_DATE DATE);
--variant 1
SET pg_trgm.similarity_threshold = 0.8; 
UPDATE TBL_A A SET TABLE_B_ID = B.ID
FROM TBL_B B
WHERE A.NAME % B.NAME
AND A.SURNAME % B.SURNAME
AND ABS(A.BIRTH_DATE ::DATE - B.BIRTH_DATE ::DATE)<=1 
--variant 2
UPDATE TBL_A A SET TABLE_B_ID = B.ID
FROM TBL_B B
WHERE A.NAME = B.NAME
AND A.SURNAME = B.SURNAME
AND ABS(A.BIRTH_DATE ::DATE - B.BIRTH_DATE ::DATE)<=1   
--variant 3
UPDATE TBL_A A SET TABLE_B_ID = B.ID
FROM TBL_B B
WHERE levenshtein_less_equal (A.NAME ,B.NAME,2)<=2
AND levenshtein_less_equal (A.SURNAME ,B.SURNAME,2)<=2 
AND ABS(A.BIRTH_DATE ::DATE - B.BIRTH_DATE ::DATE)<=1 

All of these options had very bad performance ( near about 7 hour). I tried creating indexes but didn't get much speed up

CREATE INDEX ind_a_name ON TBL_A USING gist(NAME  trm_gist_ops);
CREATE INDEX ind_a_Surname ON TBL_A USING gist(SURNAME  trm_gist_ops);
O. Jones
  • 103,626
  • 17
  • 118
  • 172

1 Answers1

0

Levenshtein distance comparisons cannot be indexed, unfortunately. Each comparison is a function of both input strings.

One usually approaches this sort of problem by using a two stage where clause that eliminates most comparisons, then applying Levenshtein's string-distance function.

Can you design an injective function f(name) which yields some sort of signature of the name? It could remove the vowels from the name, for a trivial example. SOUNDEX() is such a function, but it's really crude and only works properly on North American names. Metaphone is a similar function. (The guys who dreamed up these functions were all English-speakers.)

If you do that, then you can populate your table with

   name, signature_name

put a index on (signature_name, name), and use this WHERE filter.

 WHERE A.signature_name = B.signature_name
   AND levenshtein_less_equal (A.name,B.name,2)<=2

The trick: do most of your comparison work with indexed columns, and only use Levenshtein when you already know you have a close match.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • thx for your reply. Do you have any exp with ts_vectors? Maybe it could be better? – Ирина Ромашкина Aug 09 '21 at 13:49
  • put a index on (signature_name, name), and use this WHERE filter -- btree ? – Ирина Ромашкина Aug 09 '21 at 13:50
  • The kind of function I propose will work well with BTREE. But results from that function are only comparable for equality, not for value. (That is, it makes no sense to say `sig1 >= sig2 - 2` or some such thing.) So another index organization (hash) may work also. – O. Jones Aug 09 '21 at 13:54
  • To inquire about `ts_vector` use, it might be best to ask another question. Please include some sample data and desired results. For what it's worth the `ts_*` functions work best with documents (columns) containing multiple words, not single words. – O. Jones Aug 10 '21 at 00:04