making difflib's SequenceMatcher ignore "junk" characters

Question

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this

>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335

I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

It's kind of hackish, but any reason you couldn't just remove the _junk_ characters before doing the comparison? It's essentially the same thing as ignoring them. — Gareth Latty, Apr 02 '12 at 20:58
yes thats good but i wanted to figure out if i could just do some `difflib` magic and get away with it otherwise i would have to pass the string through another function to first remove all junk chars. — lovesh, Apr 02 '12 at 21:09

score 4 · Accepted Answer · edited May 23 '17 at 12:17

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().

E.g:

to_compare = to_compare.translate(None, {"-"})

As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.

Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:

translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)

You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:

def to_translation_map(iterable):
    return {key: None for key in iterable}
    #return dict((key, None) for key in iterable) #For old versions of Python without dict comps.

score 1 · Answer 2 · edited Dec 07 '13 at 16:42

1

If you were to make a function to remove all the junk character before hand you could use re:

string=re.sub('-|_|\*','',string)

for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

edited Dec 07 '13 at 16:42

BenMorel

34,448
50
182
322

answered Apr 03 '12 at 00:39

apple16

1,137
10
13

Is `-|_|\*` better than using `[-_*]` or are they equal efficiency wise? – Sam Rockett Dec 18 '18 at 10:25

making difflib's SequenceMatcher ignore "junk" characters

2 Answers2