I found that SequenceMatcher
from library difflib
can return a similarity score between two strings. However one of the argument isjunk
is little bit tricky to deal with, especially with regular expressions.
Take two strings for example:
a = 'Carrot 500g'
b = 'Cabbage 500g'
from difflib import SequenceMatcher
import re
def similar_0(a, b):
return SequenceMatcher(None, a, b).ratio()
similar_0(a, b)
def similar_1(a, b):
return SequenceMatcher(lambda x: bool(re.search(r'\b(\d)+([a-zA-Z])+\b', x)), a, b).ratio()
similar_1(a, b)
When comparing these two strings, I want to ignore all the unit information like "500g" above. But I got the same result using similar_0
vs similar_1
. I'm really confused as to how isjunk
works in SequenceMatcher
as an argument. What is the correct way to achieve the purpose, or any other alternatives?