I want to measure the similarity between two words. The idea is to read a text with OCR and check the result for keywords. The function I'm looking for should compare two words and return the similarity in %. So comparing a word with itself should be 100% similar. I wrote a function on my own and compared char by char and returned the number of matches in ratio to the length. But the Problem is that
wordComp('h0t',hot')
0.66
wordComp('tackoverflow','stackoverflow')
0
But intuitive both examples should have very high similarity >90%. Adding the Levenstein-Distance
import nltk
nltk.edit_distance('word1','word2')
in my function will increase the second result up to 92% but the first result is still not good.
I already found this solution for "R" and it would be possible to use this functions with rpy2
or use agrepy
as another approach. But I want to make the program more and less sensitive by changing the benchmark for acceptance (Only accept matches with similarity > x%).
Is there another good measure I could use or do you have any ideas to improve my function?