Fuzzy String Comparison Algorithm
The custom brute force method below provides word swapping and gives you complete control over the vowel/consonant score thresholds, but increases the total number of comparisons.
You will also want to check methods such as Apache Lucene described in this thread: Fuzzy string search library in Java
Custom Fuzzy Comparison Recipe:
- Lower Case: All comparisons will be with lower-case text. Either make sure that all words in the reference database are in lower case, or use a
String.toLower()
on each item in the database before comparison. Obviously, preprocessing the list in the database will dramatically increase performance.
- Remove Spaces and Punctuation: You must make a function that removes all spaces and other punctuation from any phrase. You should have a separate column in your reference with this information pre-calculated for an increase in performance.
- Custom Compare Function: Your
String
comparison function will compare each character and assign a custom score based on closeness of letters, in which the lowest scores will indicate the best match. For example, identical characters will add zero score. Each mismatched consonant pair will add 2 to the score. Each mismatched vowel will add 1. Mixed mismatches will add 3. Normalize the score by the number of characters. Apply a simple threshold to determine acceptable matches. In the above example, start with threshold=0.2
which will allow approximately one small mistake per 5 characters (this solves simple misspellings, but not missing characters. See Step 4 below).
- Extra or Missing Characters: Loop through each comparison an extra time for each character position. Once without the character in that position and once with an extra character in that position. Report the smallest score for all the loops. Compare that score against the threshold. Break out of the loop and stop comparing if the score is below the threshold, thus indicating a match. This will catch misspellings such as "colage" for "collage".
- Swap Words: After the loop in Step #4, if the score is still above the threshold, loop through each word of the input phrase and swap with its nearest neighbor adjacent word. and rerun the comparison suite. Obviously, you will have to look at the original raw user phrase to find the word boundaries, rather than the processed phrase without spaces and punctuation of Step #2. This will catch your requirement of allowing "Jordan Michael" to substitute for "Michael Jordan".
For long entries with more than 2 words, this method will incur 10's of comparisons per database entry or more, so there is a definite performance hit.