How to calculate the similarity measure of text document?

Question

I have CSV file that looks like:

idx         messages
112  I have a car and it is blue
114  I have a bike and it is red
115  I don't have any car
117  I don't have any bike

I would like to have the code that reads the file and performs the similarity difference.

I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.

based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".

It would be great if you can provide your help with code that I can run in my side as well. Thanks

https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/ — Robbie Milejczak, Jun 08 '19 at 20:14

ALollz · Answer 1 · 2019-06-08T23:55:35.957

1

I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

K = len(df)
similarity = np.empty((K,K), dtype=float)

for i, ac in enumerate(df['messages']):
    for j, bc in enumerate(df['messages']):
        if i > j:
            continue
        if i == j:
            sim = 100
        else:
            sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
                                     # for comparison of 2 strings.

        similarity[i, j] = sim
        similarity[j, i] = sim

df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)

Output: `df_sim`

id     112    114    115    117
id                             
112  100.0   78.0   51.0   50.0
114   78.0  100.0   47.0   54.0
115   51.0   47.0  100.0   83.0
117   50.0   54.0   83.0  100.0

edited Jun 08 '19 at 23:55

answered Jun 08 '19 at 23:09

ALollz

57,915
7
66
89

Thanks for your comment. yes, the calculation is symmetric. As i mentined, the calculations is based on chacking each string with others and gives the ration of similarity between their word like as you wrote ```78.0```. I tried to run your code and I got an error as ```AttributeError: 'DataFrame' object has no attribute 'messages'```. I am pretty new in python but i am not sure why i am getting error. can you please tell me where i am missing? – Bilgin Jun 08 '19 at 23:34
@Bilgin that is the name of the column, that contains all of the strings. In the example you provided it appeared to be `'messages'`, but in your real data, that must not be the case. Use whatever the column name is (same goes for where I use `df.idx`) – ALollz Jun 08 '19 at 23:55
thanks I fixed the issue. So, as I understand this is a fuzzy ration similarity check? How can I perform cosine similarity in here? Thanks of help. – Bilgin Jun 09 '19 at 00:26
@Bilgin See [this post](https://stackoverflow.com/questions/15173225/calculate-cosine-similarity-given-2-sentence-strings). Using the accepted solution (copy those defs into your program), you'd need something like: `vector1 = text_to_vector(ac)`, `vector2 = text_to_vector(bc)` then `sim = get_cosine(vector1, vector2)` – ALollz Jun 09 '19 at 17:49

How to calculate the similarity measure of text document?

1 Answers1

Output: `df_sim`

Linked

How to calculate the similarity measure of text document?

1 Answers1

Output: df_sim

Linked

Output: `df_sim`