I need to compare a set of strings to another set of strings and find which strings are similar (fuzzy-string matching). For example:
{ "A.B. Mann Incorporated", "Mr. Enrique Bellini", "Park Management Systems" }
and
{ "Park", "AB Mann Inc.", "E. Bellini" }
Assuming a zero-based index, the matches would be 0-1, 1-2, 2-0. Obviously, no algorithm can be perfect at this type of thing.
I have a working implementation of the Levenshtein-distance algorithm, but using it to find similar strings from each set necessitates looping through both sets of strings to do the comparison, resulting in an O(n^2) algorithm. This runs unacceptably slow even with modestly sized sets.
I've also tried a clustering algorithm that uses shingling and the Jaccard coefficient. Unfortunately, this too runs in O(n^2), which ends up being too slow, even with bit-level optimizations.
Does anyone know of a more efficient algorithm (faster than O(n^2)), or better yet, a library already written in C#, for accomplishing this?