Cosine similarity between each row in a Dataframe in Python

Question

I have a DataFrame containing multiple vectors each having 3 entries. Each row is a vector in my representation. I needed to calculate the cosine similarity between each of these vectors. Converting this to a matrix representation is better or is there a cleaner approach in DataFrame itself?

Here is the code that I have tried.

import pandas as pd
from scipy import spatial
df = pd.DataFrame([X,Y,Z]).T
similarities = df.values.tolist()

for x in similarities:
    for y in similarities:
        result = 1 - spatial.distance.cosine(x, y)

Please share what you have tried so far so that we may properly help you. — Naeem Ul Wahhab, Jul 29 '17 at 09:13
@JayanthPrakashKulkarni: in the for loops you are using, you are calculating the similarity of a row with itself as well. You don't need a nested loop as well. Iterate over the number of rows-1 and calculate the cosine similarity between `df.iloc[i,:]` and `df.iloc[i+1,:]`. Alternatively, you can look into `apply` method of dataframes. — Clock Slave, Jul 29 '17 at 15:05
@ClockSlave Thank you for your valuable input. I'll surely try using the apply method of DataFrame. — Jayanth Prakash Kulkarni, Jul 29 '17 at 18:22

miradulo · Accepted Answer · 2018-03-20T20:18:44.300

You can directly just use sklearn.metrics.pairwise.cosine_similarity.

Demo

import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.randint(0, 2, (3, 5)))

df
##     0  1  2  3  4
##  0  1  1  1  0  0
##  1  0  0  1  1  1
##  2  0  1  0  1  0

cosine_similarity(df)
##  array([[ 1.        ,  0.33333333,  0.40824829],
##         [ 0.33333333,  1.        ,  0.40824829],
##         [ 0.40824829,  0.40824829,  1.        ]])

score 2 · Answer 2 · edited May 01 '23 at 16:11

You can import pairwise_distances from sklearn.metrics.pairwise and pass the data-frame for which you want to calculate cosine similarity, and also pass the hyper-parameter metric='cosine', because by default the metric hyper-parameter is set to 'euclidean'.

DEMO

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances

df = pd.DataFrame(np.random.randint(0, 5, (3, 5)))

df
##      0   1   2   3   4
## 0    4   2   1   3   2
## 1    3   2   0   0   1
## 2    3   3   4   2   4

pairwise_distances(df, metric='cosine')
##array([[2.22044605e-16, 1.74971353e-01, 1.59831950e-01],
##   [1.74971353e-01, 0.00000000e+00, 3.08976681e-01],
##   [1.59831950e-01, 3.08976681e-01, 0.00000000e+00]])

Cosine similarity between each row in a Dataframe in Python

2 Answers2

Linked