Fast way to search based on non-literal comparison
I am developing a small search over rather large data sets, basically all strings. The relation between the table fields are simple enough, though the comparison mustn’t be literal. i.e. it should be able to correlate “filippo“, “philippo“, “filipo“ and so forth.
I have found a few ways it could be done, very frequently stumbling on Levinstein distance (this, here and here), though I am not sure it is practical on my specific case.
In a nutshell I have two tables, a small one with “search keys“ and a more massive one in which the search should be performed. Both tables have the same fields and they both have the same "meaning". E.g.
KEYS_TABLE
# | NAME | MIDNAME | SURNAME | ADDRESS | PHONE
1 | John | Fake | Doe | Sesame St. | 333-12-32
2 | Ralph | Stue | Michel | Bart. Ghost St. | 778-13000
...
and
SEARCH_TABLE
# | NAME | MIDNAME | SURNAME | ADDRESS | PHONE
...
532 | Jhon | F. | Doe | Sesame Street | 3331232
...
999 | Richard | Dalas | Doe | Sesame St. | 333-12-32
All I want to do is os obtain some sort of metric, or rank for each given record on KEYS_TABLE
, report all records from SEARCH_TABLE
above a certain relevance (defined either by the metric or simply some "KNN" like method).
I say that Levinstein distance might not be practical because it would require to calculate for every field in every row in KEYS_TABLE
x SEARCH_TABLE
. Considering that SEARCH_TABLE
has about 400 million records and KEYS_TABLE
varies from 100k to 1mil, the resulting number is way too large.
I was hoping there was some way I could previously enrich both tables, or some simpler (cheaper) way to perform the search.
Worth mentioning that I am allowed to transform the data at will. e.g. normalise St.
to st
, Street
to st
, remove special chars and so on.
What would be my options?