1

I tried asking this question with my original data and code but I realised it might have been too much to read, so I am going to try to create some toy data to make the question simpler. Here is the code with some toy data, should be easy to copy/paste to reproduce:

import pandas as pd

df = pd.DataFrame([['A boy ran.', [1,2], 1, [5,7], 0.997], ['A good pet.', [7,9], 0, [3,2], 0.977], ['The car is fast.', [7,5], 1, [1,9], 0.962], ['The girl sang.', [0,5], 2, [4,1], 0.992]], columns=['sentences', 'embeddings', 'labels', 'cluster_centres', 'cosine_scores'])
print(df)

new_df = df.groupby(['labels']).max()
print(new_df)

The initial dataframe (df) has 5 columns and the columns mimic my original data (except the values are much simpler): sentences contains one sentence in each row, embeddings and cluster_centres contains a numeric array in each row, labels contains values of either 0,1 or 2, and cosine_scores contains a float in each row.

I would like to group the rows by the values in the labels column (so 0s and 1s and 2s are together) and then get the sentence from the sentence column in the row that has the max value from the cosine_scores column for each label. So to clarify, in the above example, there are two rows with a value of 1 in the labels column. The first row (at row index=0) has the higher cosine_score than the other row (at row index=2) (specifically: 0.997>0.962). Thus, for the labels of 1, I would want the sentence from index=0 ('A boy ran.'). However, when I run the above code, I get the following dataframe for new_df:

               sentences embeddings cluster_centres  cosine_scores
labels                                                            
0            A good pet.     [7, 9]          [3, 2]          0.977
1       The car is fast.     [7, 5]          [5, 7]          0.997
2         The girl sang.     [0, 5]          [4, 1]          0.992

As you can see, it is choosing the correct max value for the cosine_scores for labels=1 (0.997 from row index=0), however, in the sentences column it is choosing the wrong sentence (should be A boy ran and not The car is fast). Based on my analysis of my actual data, this is because it is choosing the sentence that starts with the "max" alphabet letter (i.e. the letter that is alphabetically later, in this case T is after A so the other sentence was chosen). Anyway, so my question is, how do I just choose the max value for ONLY cosine_scores and return the other columns from the same row as that max value for each label in labels? Thanks for any help!!

Nore Patel
  • 35
  • 7

1 Answers1

1

sort based on labels and cosine_scores, and apply drop_duplicates

df.sort_values(['labels', 'cosine_scores'], ascending=False).drop_duplicates(['labels'])

which give the following output

        sentences embeddings  labels cluster_centres  cosine_scores
3  The girl sang.     [0, 5]       2          [4, 1]          0.992
0      A boy ran.     [1, 2]       1          [5, 7]          0.997
1     A good pet.     [7, 9]       0          [3, 2]          0.977
Prince Francis
  • 2,995
  • 1
  • 14
  • 22
  • Nevermind :( this method and all the methods from the link at the top work on the toy data but raise various errors when I try to use them on my actual data... no idea why. – Nore Patel Jan 10 '20 at 09:07
  • Nevermind again, it worked, my cosine_scores column was actually arrays in the original dataset, had to extract the floats from it, thanks :) – Nore Patel Jan 10 '20 at 09:29