I have 2 columns of disease names, I have to try and match the best options. I tried using "SequenceMatcher" module and "fuzzywuzzy" module in python and the results were surprising. I have pasted the results and my doubts below:
Consider there is a disease "liver neoplasms" which I need to match to the best matching name "cancer, liver" or "cancer, breast". Now it's obvious that since liver is a matching word, it should easily pick up "cancer, liver" to be the answer but that isn't happening. I would like to know the reason and a better way to match in python.
from difflib import SequenceMatcher
s1 = 'liver neoplasms'
s2 = 'cancer, liver'
SequenceMatcher(None, s1, s2).ratio()
# Answer = 0.3571
s2 = 'cancer, breast'
SequenceMatcher(None, s1, s2).ratio()
# Answer = 0.4137
# fuzzy.ratio also has the same results.
My doubt is how does cancer, breast be more matching than cancer, liver. Which other technique can I use to get this done properly?
Thank you :)