2

I have 2 columns of disease names, I have to try and match the best options. I tried using "SequenceMatcher" module and "fuzzywuzzy" module in python and the results were surprising. I have pasted the results and my doubts below:

Consider there is a disease "liver neoplasms" which I need to match to the best matching name "cancer, liver" or "cancer, breast". Now it's obvious that since liver is a matching word, it should easily pick up "cancer, liver" to be the answer but that isn't happening. I would like to know the reason and a better way to match in python.

from difflib import SequenceMatcher

s1 = 'liver neoplasms'
s2 = 'cancer, liver'

SequenceMatcher(None, s1, s2).ratio() 
# Answer = 0.3571

s2 = 'cancer, breast'
SequenceMatcher(None, s1, s2).ratio()
# Answer = 0.4137 

# fuzzy.ratio also has the same results.

My doubt is how does cancer, breast be more matching than cancer, liver. Which other technique can I use to get this done properly?

Thank you :)

Pynchia
  • 10,996
  • 5
  • 34
  • 43
  • 1
    This answer may help https://stackoverflow.com/a/17388687/548562 – Iain Shelvington Dec 23 '19 at 03:42
  • There are quite a few algorithms to do string matching. Have a look here: https://en.wikipedia.org/wiki/Levenshtein_distance – k88 Dec 23 '19 at 03:47
  • So, you want to get a percentage based on semantics? if so, try reading about word vectors. If you have any kind of data.. use it to retrain those word vectors. That might help – rawwar Dec 23 '19 at 04:19
  • This is too vague and broad, and probably off-topic. – AMC Dec 23 '19 at 04:48

4 Answers4

3

These types of matchers have no semantic understanding. They simply count how many characters match. Some are more sophisticated than others.

The levenshtein distance might help. See https://github.com/ztane/python-Levenshtein.

from difflib import SequenceMatcher from Levenshtein import distance

s1 = 'liver neoplasms' s2 = 'cancer, liver'

print('Sequence-matcher: ', SequenceMatcher(None, s1, s2).ratio()) 
# Answer = 0.35...

print('Levenshtein: ', distance(s1, s2))
# Answer = 13

s2 = 'cancer, breast' 

print('Sequence-matcher: ', SequenceMatcher(None, s1, s2).ratio()) 
# Answer = 0.41...

print('Levenshtein: ', distance(s1, s2))
# Answer = 12
Tammo Heeren
  • 1,966
  • 3
  • 15
  • 20
2

here is cosine similarity matching algorithm between two strings.

Please refer to the following link for explanation of the theory

https://blog.nishtahir.com/2015/09/19/fuzzy-string-matching-using-cosine-similarity/

import re
import math
from collections import Counter


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    word = re.compile(r'\w+')
    words = word.findall(text)
    return Counter(words)


def get_result(content_a, content_b):
    text1 = content_a
    text2 = content_b

    vector1 = text_to_vector(text1)
    vector2 = text_to_vector(text2)

    cosine_result = get_cosine(vector1, vector2)
    return cosine_result


print(get_result('liver neoplasms', 'cancer, liver'))
print(get_result('liver neoplasms', 'cancer, breast'))
joonghyup cha
  • 624
  • 3
  • 12
1

It seems both difflib.SequenceMatcher and fuzzywuzzy use the same mechanism for determining similarity. Namely, the Levenshtein Distance, which can be effectively summarized as "the number of modifications required to turn one string into the other".

Here, according to this calculator, the Levenshtein Distance between 'liver neoplasms' and 'cancer, liver' is 13. Meanwhile, the distance between 'liver neoplasms' and 'cancer, breast' is 12 - slightly smaller.

Levenshtein distance does not seem to be the ideal solution to this problem.


In your case, I would instead try to use some form of keyword matching. I'm not well-read in proper techniques for doing so, but my instinct would be to split the input into keywords and the possible outputs into keywords:

input_keywords = 'liver neoplasms'.split()
possibility_keywords = {title: title.split(', ') for title in ('cancer, breast', 'cancer, liver')}

and then do some sort of weighted matching (whichever set of possibility keywords is closest to the input's set of keywords - you might have to get creative to figure out efficient ways to calculate this) or keyword detection. For example:

def ratio(input_keywords, possibility_keywords):
    return sum(
        min(
            SequenceMatcher(None, inp_kw, poss_kw).ratio() for poss_kw in possibility_keywords
        ) for inp_kw in input_keywords
     )

A quick cursory glance found this paper which might be relevant. Or the cosine similarity algorithm mentioned by the other answer.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
0

We need to use semantic similarity algorithms here as Neoplasm and Cancer are similar terms however if we go for a Levenshtein Distance or Keyword based matching, it will have poor match or no match at all.

Train the word2vec model on corpus of such terminologies and use this Model for obtaining Word Vectors. At this stage we can use cosine similarity, softcosine similarity etc to create the similarity index from Word Vectors and get the similarity between two semantically matching words.

Reference: Link

Arun
  • 421
  • 3
  • 6