Effective algorithm for matching complex name variations using Java

Question

I have to match a number of store names and I am having a difficult time getting Levenshtein and SoundEx to yield acceptable results given my data.

Here are a few examples of what I'm dealing with:

The Home Depot 
Office Depot
Apple store 
Apple 
Walgreens 
Walgreens Denver 
Quiznos 
Quiznos Sandwich Restaurants

So given "Quiznos Sandwich Restaurants", for example, I'd want to match it to "Quiznos"... "Walgreens Denver" to "Walgreens". I have a whole list of these store names.

Any help would be great.

I'm not sure I understand your question. Care to elaborate a bit? — jrad, Jul 18 '12 at 04:49
On the contrary... I'm *quite sure* I don't understand this question — Bohemian, Jul 18 '12 at 04:51
Levenshtein represents edit distance, and is not a measure of semantic closeness ... `Quiznos` and `Quiznos Sandwich Restaurants` have a large edit distance. — Greg Kopff, Jul 18 '12 at 04:55
What kind of format is the `Quiznos Sandwich Restaurants` and `Walgreens Denver` in? Are they all just in one large string, or in some kind of collection, seperated into individual entries? — Jason Larke, Jul 18 '12 at 04:56
sounds to me like you're looking for something kind of like [*Google suggest API*](http://blogoscoped.com/archive/2006-08-17-n22.html) — Nir Alfasi, Jul 18 '12 at 04:57

score 1 · Accepted Answer · edited May 23 '17 at 10:30

maybe try and narrow down the search field by "canonicalizing" a bit? remove fluff like "the" and "store" from the query, run it through a dictionary to fix obvious mistakes and typos?, identifying and removing obvious location references (like "denver" above) could also help.

edit: to expand a bit (and name-drop a few other CS topics ;-) ) - if youre really looking at solving the "best" (most complex) way, you'd need to take your input string, run it through some parts-of-speech tagger (see helpful question here Java Stanford NLP: Part of Speech labels?) and then use the tagging data to remove connecting words (for example - "mcdonalnds around manhatten" - around could be identified and removed). maybe it'll even help identify plural forms (dont know, never tried) so things like "home depots in washington" could be canonicalized to "home depot"

score 0 · Answer 2 · answered Jul 18 '12 at 04:59

For this problem you know Levenshtein Complexity is O(mn) which is very high for huge data.

By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small.

Link to Lazy Evaluation : http://en.wikipedia.org/wiki/Lazy_evaluation

OR we can also initializing the first row of the matrix with 0, the algorithm can be used for fuzzy string search of a string in a text. This modification gives the end-position of matching substrings of the text. To determine the start-position of the matching substrings, the number of insertions and deletions can be stored separately and used to compute the start-position from the end-position.

Effective algorithm for matching complex name variations using Java

2 Answers2