How to accept strings approximately correct as correct, when comparing?

Question

I have a prepopulated sqlite database imported to assets folder and I use it to set some text to my buttons and to compare user's input with my correct answers in that database. But I have two problems which I don't how to solve.

For example I have an answer which is "Michael Jordan" or some other two words. I a user enters Michael Jordan i'm good to go, but if he enter Jordan Michael I'm in trouble. It will popup a wrong answer alert. Is there a way to accept these words shuffles?
Also, if I have an answer "Balls" and user type in "ball" this will be wrong aswer. How to make sure that all singulars and plurals get accepted?

I actually don't have an idea what to try. I was thinking in my head, but nothing came to mind. — marjanbaz, Apr 04 '13 at 22:46
You could try a function that calculates a "similarity" index for two given strings. You may split the candidate in words, and depending on each character that's not identical to the one on the same position in the original, decrease the value-to-be-returned. Accept only strings that share a value X of similarity. — , Apr 04 '13 at 22:58
If someone enters 'Jordan Michael' as the answer to a question that should be 'Michael Jordan' it SHOULD be wrong... :) — boltup_im_coding, Apr 05 '13 at 00:40

score 3 · Answer 1 · edited May 23 '17 at 12:03

Fuzzy String Comparison Algorithm

The custom brute force method below provides word swapping and gives you complete control over the vowel/consonant score thresholds, but increases the total number of comparisons.

You will also want to check methods such as Apache Lucene described in this thread: Fuzzy string search library in Java

Custom Fuzzy Comparison Recipe:

Lower Case: All comparisons will be with lower-case text. Either make sure that all words in the reference database are in lower case, or use a String.toLower() on each item in the database before comparison. Obviously, preprocessing the list in the database will dramatically increase performance.
Remove Spaces and Punctuation: You must make a function that removes all spaces and other punctuation from any phrase. You should have a separate column in your reference with this information pre-calculated for an increase in performance.
Custom Compare Function: Your String comparison function will compare each character and assign a custom score based on closeness of letters, in which the lowest scores will indicate the best match. For example, identical characters will add zero score. Each mismatched consonant pair will add 2 to the score. Each mismatched vowel will add 1. Mixed mismatches will add 3. Normalize the score by the number of characters. Apply a simple threshold to determine acceptable matches. In the above example, start with threshold=0.2 which will allow approximately one small mistake per 5 characters (this solves simple misspellings, but not missing characters. See Step 4 below).
Extra or Missing Characters: Loop through each comparison an extra time for each character position. Once without the character in that position and once with an extra character in that position. Report the smallest score for all the loops. Compare that score against the threshold. Break out of the loop and stop comparing if the score is below the threshold, thus indicating a match. This will catch misspellings such as "colage" for "collage".
Swap Words: After the loop in Step #4, if the score is still above the threshold, loop through each word of the input phrase and swap with its nearest neighbor adjacent word. and rerun the comparison suite. Obviously, you will have to look at the original raw user phrase to find the word boundaries, rather than the processed phrase without spaces and punctuation of Step #2. This will catch your requirement of allowing "Jordan Michael" to substitute for "Michael Jordan".

For long entries with more than 2 words, this method will incur 10's of comparisons per database entry or more, so there is a definite performance hit.

He could also use `str1.equalsIgnoreCase(str2)` instead of converting everything to lowercase. — boltup_im_coding, Apr 05 '13 at 00:38
Yes, he could use `String.equalsIgnoreCase()`, but that would add an order of magnitude or more extra processing onto the algorithm. Each reference string in the database and each iteration of the user phrase would have to be converted to lower case with this function for every single comparison. There will be much better performance if the database is pre-lowered and the user phrase is only lowered once. Similar argument for pre-processing a redundant column in the database without punctuation and white space. — David Manpearl, Apr 05 '13 at 00:42
Oh ok - taking performance into account. I wasn't sure if you just glossed over this method. — boltup_im_coding, Apr 05 '13 at 00:45

score 0 · Answer 2 · answered Apr 04 '13 at 22:50

This is a great question. I think, realistically you need a dictionary of "valid" words. However a dictionary on its own will not solve your problems. You also need a set of heuristics based on your dictionary as to what constitutes a valid entry.

I would be tempted to try "tries" here as you can encapsulate a rich text base better that alternate methods. Tries, in this case will offer comparable performance to say a word dictionary or the likes. The additional benefit of using tries is that it is fairly trivial to add new words/phrases to your application. The downside, tries use a fair amount of memory. That said, there are techniques one can use to compact data.

Well, tries are essentially (in this case) just strings so you probably can. Tries will simply help you identify "correct" words, phrases, how you use the correct data is application dependent. — , Apr 04 '13 at 23:18

How to accept strings approximately correct as correct, when comparing?

2 Answers2

Fuzzy String Comparison Algorithm