I tried asking this question with my original data and code but I realised it might have been too much to read, so I am going to try to create some toy data to make the question simpler. Here is the code with some toy data, should be easy to copy/paste to reproduce:
import pandas as pd
df = pd.DataFrame([['A boy ran.', [1,2], 1, [5,7], 0.997], ['A good pet.', [7,9], 0, [3,2], 0.977], ['The car is fast.', [7,5], 1, [1,9], 0.962], ['The girl sang.', [0,5], 2, [4,1], 0.992]], columns=['sentences', 'embeddings', 'labels', 'cluster_centres', 'cosine_scores'])
print(df)
new_df = df.groupby(['labels']).max()
print(new_df)
The initial dataframe (df
) has 5 columns and the columns mimic my original data (except the values are much simpler): sentences
contains one sentence in each row, embeddings
and cluster_centres
contains a numeric array in each row, labels
contains values of either 0,1 or 2, and cosine_scores
contains a float in each row.
I would like to group the rows by the values in the labels
column (so 0s and 1s and 2s are together) and then get the sentence from the sentence
column in the row that has the max value from the cosine_scores
column for each label. So to clarify, in the above example, there are two rows with a value of 1 in the labels
column. The first row (at row index=0) has the higher cosine_score than the other row (at row index=2) (specifically: 0.997>0.962). Thus, for the labels of 1, I would want the sentence from index=0 ('A boy ran.'
). However, when I run the above code, I get the following dataframe for new_df
:
sentences embeddings cluster_centres cosine_scores
labels
0 A good pet. [7, 9] [3, 2] 0.977
1 The car is fast. [7, 5] [5, 7] 0.997
2 The girl sang. [0, 5] [4, 1] 0.992
As you can see, it is choosing the correct max value for the cosine_scores for labels=1 (0.997 from row index=0), however, in the sentences column it is choosing the wrong sentence (should be A boy ran
and not The car is fast
). Based on my analysis of my actual data, this is because it is choosing the sentence that starts with the "max" alphabet letter (i.e. the letter that is alphabetically later, in this case T is after A so the other sentence was chosen). Anyway, so my question is, how do I just choose the max value for ONLY cosine_scores and return the other columns from the same row as that max value for each label in labels
? Thanks for any help!!