-1

I am having issues with assigning the cosine similarity in array back to pandas Dataframe. I have tested the cosine similarity matrix using the below code

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

The below is the output produced by code enter image description here

However I want to write the similarity score back to a Dataframe with structure like below

enter image description here

Dummy data code to replicate the example

df1 = pd.DataFrame(columns=['Query','Corpus'])
df1['Query'] = ["A man is eating pasta","A man is eating pasta","A man is eating pasta","A man is eating pasta","A man is eating pasta"]
df1['Corpus'] = ["A man is eating food","A man is eating a piece of bread.","A man is riding a horse","A man is riding a white horse on an enclosed ground","A cheetah is running behind its prey"]

df1

**Detailed example can be found here https://www.codegrepper.com/code-examples/python/sentence+transformers **

I did reference similar questions Cosine Similarity for Sentences in Dataframe & Cosine similarity of rows in pandas DataFrame however they don't answer my Query. Any pointers would be helpful.

EricA
  • 403
  • 2
  • 14
  • If it picks the 5 top ones, in your code why do you want 9 results each do you the cos similarity from all the options? – INGl0R1AM0R1 Aug 03 '22 at 16:29
  • It picks top 5, but if the corpus is not in top 5 then it should assign 0 or it is alright for it to be empty. Or the dataframe could not include the rows which are not in Top 5.. – EricA Aug 03 '22 at 16:33
  • Edited to return top 5 only in the dataframe – EricA Aug 03 '22 at 16:35
  • Thanks Peter, Noted and would be taken care in my future posts. Many thanks for pointers – EricA Aug 03 '22 at 17:59

1 Answers1

0

I dont know what you tried but a simple dictionary would be the way i would go

dc = {'Query':[],'Corpus':[],'Cos_sim':[]} 

    for query in queries:
        query_embedding = model.encode(query, convert_to_tensor=True)
        cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
        top_results = torch.topk(cos_scores, k=top_k)
        dc['Query'] = [query] * 5
        for score, idx in zip(top_results[0], top_results[1]):
            dc['Corpus'] = corpus[idx]
            dc['Cos_sim'] = score

After that just do

df = pd.DataFrame(dc)

That will give you your wanted df

INGl0R1AM0R1
  • 1,532
  • 5
  • 16
  • I tried the code above and all logic makes sense after from dc['Query] = [query]*5. I also did not get the desired output. See below for the returned dataframe from code. – EricA Aug 03 '22 at 18:32
  • ```df = pd.DataFrame(columns=['Query','Corpus','Cos_sim']) df['Query'] = ["A man is eating pasta","Someone in a gorilla costume is playing a set of drums","A cheetah chases prey on across a field"] df['Corpus'] = ["A woman is playing violin","A woman is playing violin","A woman is playing violin"] df['Cos_sim'] = ["tensor(0.0762)","tensor(0.0762)","tensor(0.0762)"] ``` – EricA Aug 03 '22 at 18:32