2

I need to compare a set of strings to another set of strings and find which strings are similar (fuzzy-string matching). For example:

{ "A.B. Mann Incorporated", "Mr. Enrique Bellini", "Park Management Systems" } 
and
{ "Park", "AB Mann Inc.", "E. Bellini" }

Assuming a zero-based index, the matches would be 0-1, 1-2, 2-0. Obviously, no algorithm can be perfect at this type of thing.

I have a working implementation of the Levenshtein-distance algorithm, but using it to find similar strings from each set necessitates looping through both sets of strings to do the comparison, resulting in an O(n^2) algorithm. This runs unacceptably slow even with modestly sized sets.

I've also tried a clustering algorithm that uses shingling and the Jaccard coefficient. Unfortunately, this too runs in O(n^2), which ends up being too slow, even with bit-level optimizations.

Does anyone know of a more efficient algorithm (faster than O(n^2)), or better yet, a library already written in C#, for accomplishing this?

MgSam
  • 12,139
  • 19
  • 64
  • 95
  • Have you benchmarked LINQ at the task you are trying to perform? Do you have example snippets of exactly what you are trying to do? How large are these arrays? – corylulu Nov 07 '12 at 18:04
  • What do you need returned from the algorithm? Do you need to know if [bool valueMatchesInB = arrayB.Contains(arrayA[i])] Or do you need to know exactly which index matches which index? – corylulu Nov 07 '12 at 18:28
  • Basically, for a given string, the algorithm needs to return which members in a set of strings are fuzzy-matches within some tolerance. – MgSam Nov 07 '12 at 18:40
  • **Use the search function, Luke**. I've seen this question mutliple times here. Look up suffix trees, too. – Has QUIT--Anony-Mousse Nov 07 '12 at 18:42
  • I did search, and as I say quite clearly in the question, the typical answers always end up being n^2. – MgSam Nov 07 '12 at 18:43
  • I didn't even realize, but even the article you mentioned suggested moving to parallel processing using openmp. But if your dedicated to C#, your options are more limited. – corylulu Nov 07 '12 at 19:06
  • http://stackoverflow.com/questions/13004248/choosing-an-appropriate-data-structure-hash-table-vs-suffix-tree-for-indexing http://stackoverflow.com/questions/6665398/algorithm-string-similarity-score-hash http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings – Has QUIT--Anony-Mousse Nov 07 '12 at 19:11
  • @MgSam do you know about FASTA and BLASTA – Grijesh Chauhan Dec 13 '12 at 11:33

3 Answers3

1

Not a direct answer to the O(N^2) but a comment on the N1 algorithm.

That is sample data but it is all clean. That is not data that I would use Levenstien on. Incriminate would have closer distance to Incorporated than Inc. E. would not match well to Enrique.

Levenshtein-distance is good at catching key entry errors.
It is also good for matching OCR.

If you have clean data I would go with stemming and other custom rules.
Porter stemmer is available for C# and if you have clean data
E.G.
remove . and other punctuation
remove stop words (the)
stem
parse each list once and assign an int value for each unique stem
do the match on int
still N^2 but now N1 is faster
you might add in a single cap the matches a word that start with cap gets a partial score
also need to account for number of words
two groups of 5 that match of 3 should score higher then two groups of 10 that match on 4

I would create Int hashsets for each phrase and then intersect and count.

Not sure you can get out of N^2.
But I am suggesting you look at N1.

Lucene is a library with phrase matching but it is not really set up for batches.
Create the index with the intent it is used many time so index search speed is optimized over index creation time.

paparazzo
  • 44,497
  • 23
  • 105
  • 176
0

In the given examples at least one word is always matching. A possible approach could use a multimap (a dictionary being able to store multiple entries per key) or a Dictionary<TKey,List<TVlaue>>. Each string from the first set would be splitted into single words. These words would be used as key in the multimap and the whole string would be stored as value.

Now you can split strings from the second set into single words and do an O(1) lookup for each word, i.e. an O(N) lookup for all the words. This yields a first raw result, where each match contains at least one matching word. Finally you would have to refine this raw result by applying other rules (like searching for initials or abbreviated words).

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
  • "`In the given examples at least one word is always matching`" Depends on the assumption but I don't think this would work for `fuzzy-string matching`. – L.B Nov 07 '12 at 21:40
0

This problem, called "string similarity join," has been studied a lot recently in the research community. We released a source code package in C++ called Flamingo that implements such an algorithm http://flamingo.ics.uci.edu/releases/4.1/src/partenum/. We also have a Hadoop-based implementation at http://asterix.ics.uci.edu/fuzzyjoin/ if your data set is too large for a single machine.

Chen Li
  • 86
  • 2