0

I have data like this

name     name in another column 
-------------------------------
raju      vasu
ramana    seshu
seshu     ramana

I want to calculate similarity between these columns

raju * vasu similarity

like this I want to get the similarity score for each row

name     name in another column  similarity
-------------------------------------------
raju     vasu                    0.1
ramana   seshu                   0.2
seshu    ramana                  0
emilanov
  • 362
  • 4
  • 15

3 Answers3

0

This post probably answers your question.

Short example code

from difflib import SequenceMatcher

names_a = ["raju", "ramana", "seshu"]
names_b = ["vasu", "seshu", "ramana"]
similar = [SequenceMatcher(None, a, b).ratio() for a,b in zip(names_a, names_b)]

The output:

In [7]: similar
Out[7]: [0.5, 0.0, 0.0]
emilanov
  • 362
  • 4
  • 15
0

fuzzywuzzy module can be used for string matching

e.g.

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    97
>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

For more details please visit https://pypi.org/project/fuzzywuzzy/

JON
  • 1,668
  • 2
  • 15
  • 18
0

Fuzzy wuzzy is good to do what you want but very slow if you have a lot of lines in your dataset.

i would use a vectorizer from sklearn (ex: TfidfVectorizer) to transform strings in vector then pass it in a cosine_similarity (from sklearn aswell)

AdForte
  • 305
  • 2
  • 12