4

I want to measure the similarity between two words. The idea is to read a text with OCR and check the result for keywords. The function I'm looking for should compare two words and return the similarity in %. So comparing a word with itself should be 100% similar. I wrote a function on my own and compared char by char and returned the number of matches in ratio to the length. But the Problem is that

wordComp('h0t',hot')
0.66
wordComp('tackoverflow','stackoverflow')
0

But intuitive both examples should have very high similarity >90%. Adding the Levenstein-Distance

import nltk
nltk.edit_distance('word1','word2')

in my function will increase the second result up to 92% but the first result is still not good.

I already found this solution for "R" and it would be possible to use this functions with rpy2 or use agrepy as another approach. But I want to make the program more and less sensitive by changing the benchmark for acceptance (Only accept matches with similarity > x%).

Is there another good measure I could use or do you have any ideas to improve my function?

tifi90
  • 403
  • 4
  • 13

2 Answers2

9

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
    sim = SequenceMatcher(None, a, b).ratio()
    if sim > c: 
        return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None
ragamuffin
  • 460
  • 2
  • 12
  • Thank you for the idea. This helps me with the first problem but the problem with short words is still open unanswered. Any other ideas on that? – tifi90 Nov 29 '18 at 12:41
  • Im not quite sure why you want a higher value for the three letter word. You say that intuitively you expected a higher similarity. Strictly speaking, out of three characters one is different between the strings, which makes them 66% similar. Can you elaborate on what your expected outcome should be and why? – ragamuffin Nov 29 '18 at 14:14
  • I don't know what the exact outcome should. The point that makes me thing of a higher score is if you compare h0t and hxt than in an intuitive way h0t is closer to hot than hxt since 0 and o are nearly the same. Just imagine if this where handwritten you wouldn't really mark h0t as wrong but hxt is clearly. – tifi90 Nov 29 '18 at 14:27
  • Well yea, they are aestetically similar, I dont know of any way to test for that. That is quite subjective as well, isnt it? For all intents and purposes x and o and 0 are equally dissimilar to one another. – ragamuffin Nov 29 '18 at 15:22
  • I just thought about the following "quick and dirty" fix: Just map digits to chars with a fixed mapping (0->o, 5->s,3->E,9->g,...). Since I'm searching for real words a zero or five or what ever number should never be part of the keyword. – tifi90 Nov 29 '18 at 15:35
  • Yes, that could work. Just out of curiosity: Would you adjust for that in the similarity ratio by a factor or would you just take a 5 for an s, a 9 for a g etc? – ragamuffin Nov 29 '18 at 17:08
  • just take the digits and map them to the characters. It works surprisingly well. I've added some more lines of code and other rules. For example a regulation therm -0.1 for each mapped digit for the ratio or a capital "i" in the middle of a word will be mapped to "L" and so on... I'll share my result when all rules are implemented. – tifi90 Nov 29 '18 at 19:56
0

I wrote the following code. try it. I defined a str3 for those occasions that length of two comparing string(str1 and str2) is not equal. the code is in while loop for exiting use k input.

k=1
cnt=0
str3=''
while not k==-1:
    str1=input()
    str2=input()
    k=int(input())

    if len(str1)>len(str2):
        str3=str1[0:len(str2)]
        for j in range(0,len(str3)):
            if str3[j]==str2[j]:
                cnt+=1
        print((cnt/len(str1)*100))

    elif len(str1)<len(str2):
        str3=str2[0:len(str1)]
        for j in range(0,len(str2)):
            if str3[j]==str1[j]:
                cnt+=1
        print((cnt/len(str2)*100))

    else:
        for j in range(0,len(str2)):
            if str2[j]==str1[j]:
                cnt+=1
        print((cnt/len(str1)*100))
MH.AI.eAgLe
  • 592
  • 1
  • 6
  • 22
  • thanks for sharing your code. This looks like what I have tried in the first place. You get similar results with this function like I did. The main problem I see is that you'll lose a lot of information when you cut the string `str3=str2[0:len(str1)]`. – tifi90 Nov 29 '18 at 12:52