2

I have this dataframe call quest:

    0_score     1_score     2_score     3_score     4_score     5_score     true_label
0   0.007512    0.264500    0.273147    0.218029    0.233726    0.003084    1
1   0.130695    0.289085    0.173402    0.144897    0.238129    0.023792    1
2   0.006896    0.130070    0.289822    0.210133    0.219567    0.143512    4
3   0.006819    0.178320    0.259109    0.041048    0.316587    0.198118    1
4   0.011121    0.058437    0.182823    0.317847    0.123521    0.306250    3

I want to create a new column, based on the value in column true_label. I can do this:

scores = ['0_score', '1_score', '2_score', '3_score', '4_score','5_score']
(quest.assign(true_label_score = lambda df_:df_[scores[1]]))

Which gives me this:


    0_score     1_score     2_score     3_score     4_score     5_score     true_label  true_label_score
0   0.007512    0.264500    0.273147    0.218029    0.233726    0.003084    1   0.264500
1   0.130695    0.289085    0.173402    0.144897    0.238129    0.023792    1   0.289085
2   0.006896    0.130070    0.289822    0.210133    0.219567    0.143512    4   0.130070
3   0.006819    0.178320    0.259109    0.041048    0.316587    0.198118    1   0.178320
4   0.011121    0.058437    0.182823    0.317847    0.123521    0.306250    3   0.058437

How do I replace the [scores[1]] with something like score[quest.true_label] so that for each row it uses the value in the true_label column to give me the correct column from the list scores, so that the value in column true_label_score comes from the matching column? Index row 2 should be using the value from 4_scores column and index row 4 should use the value from 3_scores column as the values in true_label_score.

higrm
  • 23
  • 4
  • Use numpy indexing: `df['true_label_score'] = df.to_numpy()[np.arange(len(df)), df['true_label']]` – mozway Mar 29 '22 at 01:36
  • @Rodalm, I'll give your numpy solution a try, though with only 1000 rows, your first solution works lightning fast as well. My comment about the column order is that I wanted a general solution, where my list of columns "scores" would not necessarily contain the first 6 columns of the dataframe, but 6 columns of a 50 column dataframe where they could have other columns intervening. I want to use the value in the column true_ label to get the column name from the source list and then use that column name to grab the correct row value. – higrm Mar 29 '22 at 19:27
  • The true_label column contains the index for the list of columns in the list scores. Example, I could have a list scores = ["column 8", "column 14", "column 15", "column 19", "column 42"] . If the value in column true_label = 2, then for this new column, I need the value from "column 15", as this is the index 2 position in the list scores. – higrm Mar 29 '22 at 19:27

1 Answers1

0

You can use DataFrame.apply

def label_score(row):
    col_num = int(row['true_label'])
    return row[f'{col_num}_score']

quest['true_label_score'] = quest.apply(label_score, axis=1)

If you want a solution based on the scores list you can do

scores = ['0_score', '1_score', '2_score', '3_score', '4_score','5_score']

def label_score(row, scores):
    col_num = int(row['true_label'])
    col_label = scores[col_num]
    return row[col_label]

quest['true_label_score'] = quest.apply(label_score, scores=scores, axis=1)

However, assuming that the columns are in the right order (i.e. 0_score is the first column, 1_score is the second, etc.), a faster is using numpy fancy indexing, as @mozway suggested.

quest['true_label_score'] = quest.to_numpy()[np.arange(len(quest)), quest['true_label']]

Output:

>>> quest 

    0_score   1_score   2_score   3_score   4_score   5_score  true_label  true_label_score
0  0.007512  0.264500  0.273147  0.218029  0.233726  0.003084           1          0.264500
1  0.130695  0.289085  0.173402  0.144897  0.238129  0.023792           1          0.289085
2  0.006896  0.130070  0.289822  0.210133  0.219567  0.143512           4          0.219567
3  0.006819  0.178320  0.259109  0.041048  0.316587  0.198118           1          0.178320
4  0.011121  0.058437  0.182823  0.317847  0.123521  0.306250           3          0.317847
Rodalm
  • 5,169
  • 5
  • 21
  • I will give it a try. Looks good. Thank you. – higrm Mar 29 '22 at 01:31
  • @higrm No problem, I'm glad to help. Does it solve your issue? – Rodalm Mar 29 '22 at 01:48
  • it works, but only for the first 5 rows. Starting from row 6, I get NaN in the true_label_score column. Any suggestions? I have 1086 rows in total. – higrm Mar 29 '22 at 02:43
  • @higrm It's hard to guess since you just provided the first 5 rows of data. If you elaborate a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) with more data I can try to see what is the problem. – Rodalm Mar 29 '22 at 02:46
  • my bad. I mixed my stack overflow dataframe quest with my real dataframe, so the apply was only done on the first 5 rows from the quest dataframe. Your solution works great. I see that apparently I should have done some numpy reindexing, but so far, I cannot figure out how that would work, when my column content and column headings are not identical. Thanks for your help. – higrm Mar 29 '22 at 03:04
  • @higrm I added another possible solution based on numpy indexing, it should be faster. Does it work for you? What do you mean by "when my column content and column headings are not identical." ? – Rodalm Mar 29 '22 at 15:47
  • 1
    I'll give your numpy solution a try, though with only 1000 rows, your first solution works lightning fast as well. My comment about the column order is that I wanted a general solution, where my list of columns scores would not necessarily be the first 6 columns of the dataframe, but 6 columns of a 50 column dataframe where they could have other columns intervening. I want to use the value in the column true_ label to get the column name from the source list and then use that column name to grab the correct row value. See my comment above for more details. – higrm Mar 29 '22 at 19:34
  • 1
    this works: quest['true_label_score'] = quest[scores].to_numpy()[np.arange(len(quest)), quest['true_label']] – higrm Mar 29 '22 at 20:04
  • @higrm In that case use one of the first two solutions, which don't rely on the order of the DataFrame columns. Yeah, `apply` can easily handle 1000 rows ;) If you find my answer useful please consider [marking it as accepted](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). – Rodalm Mar 29 '22 at 22:52
  • 1
    Yes, your answer was very useful. Accepted! – higrm Mar 30 '22 at 00:25