-1

I would like to find the column of a data frame with the maximum value per row and if there are multiple equally large values, then return all the column names where those values are. I would like to store all of these values in the last column of the data frame. I have been referencing the following post, and am unsure of how to modify it to handle data frames:

Using Python's max to return two equally large values

So if my data looked like this

Key    Column_1  Column_2  Column_3
0          1        2         3
1          1        1         0
2          0        0         0

My goal is an output that looks like this:

Key    Column_1  Column_2  Column_3  Column_4
0          1        2         3      Column_3
1          1        1         0      Column_1,Column_2
2          0        0         0      NA

I know how to use idxmax(axis=1,skipna = True) to return the first max and know that if I change 0 to Nan in the dataframe it will populate the last row correctly, just not sure how to do this when there are multiple max values.

Any help is greatly appreciated ! I am an R programmer and this is my first time in Python.

CJJ
  • 75
  • 9

1 Answers1

2

Using dot as well, and combining it with mask:

d = df.set_index('Key').select_dtypes('number')
v = d.eq(d.max(axis=1), axis=0).dot(d.columns + ',').str.rstrip(',')
df['Column_4'] = v.mask(d.eq(0).all(axis=1)))
   Key  Column_1  Column_2  Column_3           Column_4
0    0         1         2         3           Column_3
1    1         1         1         0  Column_1,Column_2
2    2         0         0         0                NaN
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • This works perfectly, thank you ! My data frame will end up having both numeric and non numeric columns. I obviously only want the max on non numeric columns. Would this script account for that. I can always subset my data frame beforehand and then join back in. Or is there a way to subset here, say by column index? – CJJ Oct 06 '20 at 16:37
  • 1
    added `select_dtypes` so you only grab the numeric columns – Erfan Oct 06 '20 at 16:39
  • Awesome ! I am getting this error: TypeError: data type "numeric" not understood. When I do df.dtypes, my number columns are float64. Should I replace 'numeric' in your script with 'float64'? – CJJ Oct 06 '20 at 16:49
  • 1
    It's `number` instead of `numeric`. This should work – Erfan Oct 06 '20 at 16:50
  • Got it that worked thank you ! I was not familiar with eq(), mask() or dot() but they seem to be very helpful. I will look into them more. – CJJ Oct 06 '20 at 17:04
  • I am now trying to modify this to return all columns where the row value is not null (nan). If my understanding of this script is correct, would replacing the second line with v = d.eq(d.notnull(), axis=0).dot(d.columns + ',').str.rstrip(',') suffice ? – CJJ Oct 08 '20 at 14:00
  • 1
    `d.notna().dot(d.columns + ',').str.rstrip(',')` Should work – Erfan Oct 08 '20 at 14:02
  • Worked perfectly, thanks ! I work exclusively in R so Python is totally new to me. Appreciate the help! – CJJ Oct 08 '20 at 14:32