1

Fast way to search based on non-literal comparison

I am developing a small search over rather large data sets, basically all strings. The relation between the table fields are simple enough, though the comparison mustn’t be literal. i.e. it should be able to correlate “filippo“, “philippo“, “filipo“ and so forth.

I have found a few ways it could be done, very frequently stumbling on Levinstein distance (this, here and here), though I am not sure it is practical on my specific case.

In a nutshell I have two tables, a small one with “search keys“ and a more massive one in which the search should be performed. Both tables have the same fields and they both have the same "meaning". E.g.

KEYS_TABLE
# | NAME  | MIDNAME | SURNAME | ADDRESS         | PHONE
1 | John  | Fake    | Doe     | Sesame St.      | 333-12-32
2 | Ralph | Stue    | Michel  | Bart. Ghost St. | 778-13000
...

and

SEARCH_TABLE
#   | NAME     | MIDNAME | SURNAME | ADDRESS         | PHONE
...
532 | Jhon     | F.      | Doe     | Sesame Street   | 3331232
...
999 | Richard  | Dalas   | Doe     | Sesame St.      | 333-12-32

All I want to do is os obtain some sort of metric, or rank for each given record on KEYS_TABLE, report all records from SEARCH_TABLE above a certain relevance (defined either by the metric or simply some "KNN" like method).

I say that Levinstein distance might not be practical because it would require to calculate for every field in every row in KEYS_TABLE x SEARCH_TABLE. Considering that SEARCH_TABLE has about 400 million records and KEYS_TABLE varies from 100k to 1mil, the resulting number is way too large.

I was hoping there was some way I could previously enrich both tables, or some simpler (cheaper) way to perform the search.

Worth mentioning that I am allowed to transform the data at will. e.g. normalise St. to st, Street to st, remove special chars and so on.

What would be my options?

Community
  • 1
  • 1
filippo
  • 5,583
  • 13
  • 50
  • 72

2 Answers2

0

One approach (heuristic!) I can think about is:

In addition to the original fields in the table, for each field also store its normalized form obtained by some stemming algorithm. If you are using java, lucene's EnglishAnalyzer might help you with this step.

Do an exact comparison using the standard methods to find for each entry in table1 a list of candidates. An entry e2 in table2 will be a candidate to entry e1 in table1 if they have some common field where the normalized form matches the regular form. That can be done efficiently using some data structure that allows quick string searches - there are plenty of these.

For each entry in e1 - find the "best" candidate/s for it in the list, using the exact metric you chose (for example your suggested leneshtein distance)

You might want to do some post-processing to make sure you don't have two elements in table1 mapped to the same element in table2, if that's an issue.

amit
  • 175,853
  • 27
  • 231
  • 333
0

Depending on what misspellings are likely, you might be able to use Soundex or Metaphone for your searches.

Mark Leighton Fisher
  • 5,609
  • 2
  • 18
  • 29