I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's
SequenceMatcher
great for this task as it was simple and found the results good. But if i compare hellboy
and hell-boy
like this
>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335
I want such words to give a 100 percent match i.e ratio of 1.0
. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher
to ignore some "junk" characters for comparison purpose?