Cosine Similarity for Sentences in Dataframe

Question

I have a data frame which has two columns. The content column has about 8000 rows of sentences. The embeddings column has the embedding for each sentence from the content column.

I want to get the cosine similarity score for each pair of sentences.

I used: cosine_similarity (df['embeddings'][0], df['embeddings'][1:] ) However, it only gives me the cosine similarity matrix between the sentence 0 and the rest sentences.

What I want is a dataframe like:

Any hints will be super helpful. Thank you!

Might be a good idea to tag your question with Pandas or Python just to get some more attention on it — luke, Jul 05 '22 at 15:59

luke · Answer 1 · 2022-07-05T18:35:18.370

1

What you need is the cosine similarity of every combination of 2 sentences in the data frame.

This can be done using the itertools.combinations module.

Ex:

import pandas as pd
from itertools import combinations

sentenceCombs = pd.DataFrame(columns = ['Sentence0', 'Sentence1', 'CosineSim'])
idx = 0;
for comb in combinations(df.columns, 2):
   s0 = comb[0]
   s1 = comb[1]
   sentenceCombs.loc[idx] = [s0, s1, cosine_similarity(s0, s1)]

This code is untested, but with some modification (and a delimiter that definitely doesn't appear in your dataset), it should work well.

edited Jul 05 '22 at 18:35

answered Jul 05 '22 at 15:58

luke

465
1
14

Hey luke, thank you so much for your feedback. For the line df.apply(lambda x: x.append('SUPERSUPER'),1), it shows "cannot concatenate object of type ''; only Series and DataFrame objs are valid" – LikeCoding Jul 05 '22 at 16:18
edited, try that. The code is still untested, it may be easier for you to read the summary of what it's intended to do and use my code as a starting point. – luke Jul 05 '22 at 16:26
Also, @KaylaWentingJiang, see https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1 – luke Jul 05 '22 at 16:27
Thank you so much, luke! I used this line and it worked: df['content_delim'] = df['content'].astype(str) + delim. I am having another error when I was trying to split the sentence: s0 = sentences[0] s1 = sentences[1] sentenceCombs.loc[idx] = [s0, s1, cosine_similarity(s0, s1)]. It said the comb which is a tuple "'tuple' object has no attribute 'split'". – LikeCoding Jul 05 '22 at 17:09
My Bad! I thought that the combiantions module returned strings, not tuples, so that makes this problem much, much easier! I'll edit the answer – luke Jul 05 '22 at 18:33
1

I really appreciate your guidance, luke! It worked very well. :-). – LikeCoding Jul 07 '22 at 02:16
Glad I could help @LikeCoding :). If the answer is finished, it's a good idea to set it as the accepted answer and upvote. – luke Jul 07 '22 at 13:30

Cosine Similarity for Sentences in Dataframe

1 Answers1

Linked