4

I found that SequenceMatcher from library difflib can return a similarity score between two strings. However one of the argument isjunk is little bit tricky to deal with, especially with regular expressions.

Take two strings for example:

a = 'Carrot 500g'
b = 'Cabbage 500g'

from difflib import SequenceMatcher
import re

def similar_0(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar_0(a, b)

def similar_1(a, b):
    return SequenceMatcher(lambda x: bool(re.search(r'\b(\d)+([a-zA-Z])+\b', x)), a, b).ratio()

similar_1(a, b)

When comparing these two strings, I want to ignore all the unit information like "500g" above. But I got the same result using similar_0 vs similar_1. I'm really confused as to how isjunk works in SequenceMatcher as an argument. What is the correct way to achieve the purpose, or any other alternatives?

James Wong
  • 1,107
  • 3
  • 15
  • 26
  • Possible duplicate of [this](https://stackoverflow.com/questions/38129357/difflib-sequencematcher-isjunk-argument-not-considered) – Ketan Mukadam Aug 24 '17 at 10:03
  • Having looked at that post, I'm even more confused because my question has something to do with regexp. Appreciate it very much if you could provide a simpler explanation. – James Wong Aug 24 '17 at 10:16
  • @JamesWong Do you want a regex to remove any weight (500g, 100g, 2kg) from your string? If so, have you got more examples? – Mr Mystery Guest Aug 24 '17 at 10:20
  • I doubt this question has much to do with regex. It is more about how the string is parsed when passed to SequenceMatcher. The regex itself is great and working - matches `500g`. – Wiktor Stribiżew Aug 24 '17 at 10:20
  • @MrMysteryGuest Exactly, I want to filter out any unit like (500g, 100g, 2kg) as well as (500ml, 1lb) etc. I think the regexp works but the question is how does it work with `SequenceMatcher`. – James Wong Aug 24 '17 at 10:25
  • @WiktorStribiżew Yes. Any thoughts? – James Wong Aug 24 '17 at 10:26
  • Have you tried removing any weight extension before passing them to `SequenceMatcher` - Carrot vs Cabbage? Instead of a = Carrot 500g vs Cabbage 500g – Mr Mystery Guest Aug 24 '17 at 10:35
  • That didn't even cross my mind actually. And I think @Rawing just suggested that same idea as an answer. Thank you anyway. – James Wong Aug 24 '17 at 10:42

1 Answers1

4

Your regex doesn't work because SequenceMatcher passes individual characters to the isjunk function, not words:

>>> SequenceMatcher(print, 'Carrot 500g', 'Cabbage 500g')
b
0
5
a
e

g
C

You should just remove the junk from both strings before passing them to SequenceMatcher:

a = re.sub(r'\b(\d)+([a-zA-Z])+\b', '', a)
b = re.sub(r'\b(\d)+([a-zA-Z])+\b', '', b)
print(similar_0(a, b))
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149