Python regular expression for strings similarity comparison

Question

I found that SequenceMatcher from library difflib can return a similarity score between two strings. However one of the argument isjunk is little bit tricky to deal with, especially with regular expressions.

Take two strings for example:

a = 'Carrot 500g'
b = 'Cabbage 500g'

from difflib import SequenceMatcher
import re

def similar_0(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar_0(a, b)

def similar_1(a, b):
    return SequenceMatcher(lambda x: bool(re.search(r'\b(\d)+([a-zA-Z])+\b', x)), a, b).ratio()

similar_1(a, b)

When comparing these two strings, I want to ignore all the unit information like "500g" above. But I got the same result using similar_0 vs similar_1. I'm really confused as to how isjunk works in SequenceMatcher as an argument. What is the correct way to achieve the purpose, or any other alternatives?

Possible duplicate of [this](https://stackoverflow.com/questions/38129357/difflib-sequencematcher-isjunk-argument-not-considered) — Ketan Mukadam, Aug 24 '17 at 10:03
Having looked at that post, I'm even more confused because my question has something to do with regexp. Appreciate it very much if you could provide a simpler explanation. — James Wong, Aug 24 '17 at 10:16
@JamesWong Do you want a regex to remove any weight (500g, 100g, 2kg) from your string? If so, have you got more examples? — Mr Mystery Guest, Aug 24 '17 at 10:20
I doubt this question has much to do with regex. It is more about how the string is parsed when passed to SequenceMatcher. The regex itself is great and working - matches `500g`. — Wiktor Stribiżew, Aug 24 '17 at 10:20
@MrMysteryGuest Exactly, I want to filter out any unit like (500g, 100g, 2kg) as well as (500ml, 1lb) etc. I think the regexp works but the question is how does it work with `SequenceMatcher`. — James Wong, Aug 24 '17 at 10:25
Have you tried removing any weight extension before passing them to `SequenceMatcher` - Carrot vs Cabbage? Instead of a = Carrot 500g vs Cabbage 500g — Mr Mystery Guest, Aug 24 '17 at 10:35
That didn't even cross my mind actually. And I think @Rawing just suggested that same idea as an answer. Thank you anyway. — James Wong, Aug 24 '17 at 10:42

score 4 · Answer 1 · answered Aug 24 '17 at 10:31

4

Your regex doesn't work because SequenceMatcher passes individual characters to the isjunk function, not words:

>>> SequenceMatcher(print, 'Carrot 500g', 'Cabbage 500g')
b
0
5
a
e

g
C

You should just remove the junk from both strings before passing them to SequenceMatcher:

a = re.sub(r'\b(\d)+([a-zA-Z])+\b', '', a)
b = re.sub(r'\b(\d)+([a-zA-Z])+\b', '', b)
print(similar_0(a, b))

answered Aug 24 '17 at 10:31

Aran-Fey

39,665
11
104
149

Oh, great. I didn't know that. I think your solution is decent enough. Thanks. – James Wong Aug 24 '17 at 10:37

Python regular expression for strings similarity comparison

1 Answers1