1

I have a data frame which has two columns. The content column has about 8000 rows of sentences. The embeddings column has the embedding for each sentence from the content column.

enter image description here

I want to get the cosine similarity score for each pair of sentences.

I used: cosine_similarity (df['embeddings'][0], df['embeddings'][1:] ) However, it only gives me the cosine similarity matrix between the sentence 0 and the rest sentences.

What I want is a dataframe like:

enter image description here Any hints will be super helpful. Thank you!

LikeCoding
  • 33
  • 6
  • Might be a good idea to tag your question with Pandas or Python just to get some more attention on it – luke Jul 05 '22 at 15:59

1 Answers1

1

What you need is the cosine similarity of every combination of 2 sentences in the data frame.

This can be done using the itertools.combinations module.

Ex:

import pandas as pd
from itertools import combinations

sentenceCombs = pd.DataFrame(columns = ['Sentence0', 'Sentence1', 'CosineSim'])
idx = 0;
for comb in combinations(df.columns, 2):
   s0 = comb[0]
   s1 = comb[1]
   sentenceCombs.loc[idx] = [s0, s1, cosine_similarity(s0, s1)]

This code is untested, but with some modification (and a delimiter that definitely doesn't appear in your dataset), it should work well.

luke
  • 465
  • 1
  • 14
  • Hey luke, thank you so much for your feedback. For the line df.apply(lambda x: x.append('SUPERSUPER'),1), it shows "cannot concatenate object of type ''; only Series and DataFrame objs are valid" – LikeCoding Jul 05 '22 at 16:18
  • edited, try that. The code is still untested, it may be easier for you to read the summary of what it's intended to do and use my code as a starting point. – luke Jul 05 '22 at 16:26
  • Also, @KaylaWentingJiang, see https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1 – luke Jul 05 '22 at 16:27
  • Thank you so much, luke! I used this line and it worked: df['content_delim'] = df['content'].astype(str) + delim. I am having another error when I was trying to split the sentence: s0 = sentences[0] s1 = sentences[1] sentenceCombs.loc[idx] = [s0, s1, cosine_similarity(s0, s1)]. It said the comb which is a tuple "'tuple' object has no attribute 'split'". – LikeCoding Jul 05 '22 at 17:09
  • My Bad! I thought that the combiantions module returned strings, not tuples, so that makes this problem much, much easier! I'll edit the answer – luke Jul 05 '22 at 18:33
  • 1
    I really appreciate your guidance, luke! It worked very well. :-). – LikeCoding Jul 07 '22 at 02:16
  • Glad I could help @LikeCoding :). If the answer is finished, it's a good idea to set it as the accepted answer and upvote. – luke Jul 07 '22 at 13:30