Finding Similarity between strings in columns of a data frame

Question

I have data like this

name     name in another column 
-------------------------------
raju      vasu
ramana    seshu
seshu     ramana

I want to calculate similarity between these columns

raju * vasu similarity

like this I want to get the similarity score for each row

name     name in another column  similarity
-------------------------------------------
raju     vasu                    0.1
ramana   seshu                   0.2
seshu    ramana                  0

score 0 · Answer 1 · answered Jun 21 '19 at 10:47

This post probably answers your question.

Short example code

from difflib import SequenceMatcher

names_a = ["raju", "ramana", "seshu"]
names_b = ["vasu", "seshu", "ramana"]
similar = [SequenceMatcher(None, a, b).ratio() for a,b in zip(names_a, names_b)]

The output:

In [7]: similar
Out[7]: [0.5, 0.0, 0.0]

score 0 · Answer 2 · answered Jun 21 '19 at 12:06

fuzzywuzzy module can be used for string matching

e.g.

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    97
>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

For more details please visit https://pypi.org/project/fuzzywuzzy/

score 0 · Answer 3 · answered Jun 21 '19 at 12:10

0

Fuzzy wuzzy is good to do what you want but very slow if you have a lot of lines in your dataset.

i would use a vectorizer from sklearn (ex: TfidfVectorizer) to transform strings in vector then pass it in a cosine_similarity (from sklearn aswell)

answered Jun 21 '19 at 12:10

AdForte

305
2
12

Finding Similarity between strings in columns of a data frame

3 Answers3