3

With a given pandas dataframe, I'd like to create new columns for the highest, second highest, third highest, etc... values in a row. And then create another column for the corresponding column name of each of those. The code below does this for the max value of the row, but not those that follow.

Adapted from Find the column name which has the maximum value for each row

import pandas as pd

df = pd.DataFrame({'A': (23, 24, 55, 77, 33, 66),
                   'B': (12, 33, 0.2, 44, 23.5, 66),
                   'C': (1, 33, 66, 44, 5, 62),
                   'D': (9, 343, 4, 64, 24, 63),
                   'E': (123, 33, 2.2, 42, 2, 99)})

# Determine the max value and column name and add as columns to df
df['Max1'] = df.max(axis=1)
df['Col_Max1'] = df.idxmax(axis=1)

# Determine the 2nd and 3rd max PR and threshold levels and add as columns
# ???????????

print(df)

This produces:

    A     B   C    D      E   Max1 Col_Max1
0  23  12.0   1    9  123.0  123.0        E
1  24  33.0  33  343   33.0  343.0        D
2  55   0.2  66    4    2.2   66.0        C
3  77  44.0  44   64   42.0   77.0        A
4  33  23.5   5   24    2.0   33.0        A
5  66  66.0  62   63   99.0   99.0        E

Process finished with exit code 0

Only caveat would be that it is possible to have a very large number of columns, if that matters for performance. Thanks guys.

Divakar
  • 218,885
  • 19
  • 262
  • 358
RaceFrog
  • 129
  • 1
  • 9
  • For anyone trying to understand the answer below, I've found it's called "Fancy Indexing" for numpy. Good stuff. – RaceFrog Sep 27 '17 at 01:33

1 Answers1

4

One approach using the underlying array data with focus on performance would be -

a = df.values
c = df.columns
idx = a.argsort(1)[:,::-1]
vals = a[np.arange(idx.shape[0])[:,None], idx]
IDs = c[idx]

names_vals = ['Max'+str(i+1) for i in range(a.shape[1])]
names_IDs = ['Col_Max'+str(i+1) for i in range(a.shape[1])]

df_vals = pd.DataFrame(vals, columns=names_vals)
df_IDs = pd.DataFrame(IDs, columns=names_IDs)
df_out = pd.concat([df, df_vals, df_IDs], axis=1)

Sample input, output -

In [40]: df
Out[40]: 
    A     B   C    D      E
0  23  12.0   1    9  123.0
1  24  33.0  33  343   33.0
2  55   0.2  66    4    2.2
3  77  44.0  44   64   42.0
4  33  23.5   5   24    2.0
5  66  66.0  62   63   99.0

In [41]: df_out
Out[41]: 
    A     B   C    D      E   Max1  Max2  Max3  Max4  Max5 Col_Max1 Col_Max2  \
0  23  12.0   1    9  123.0  123.0  23.0  12.0   9.0   1.0        E        A   
1  24  33.0  33  343   33.0  343.0  33.0  33.0  33.0  24.0        D        E   
2  55   0.2  66    4    2.2   66.0  55.0   4.0   2.2   0.2        C        A   
3  77  44.0  44   64   42.0   77.0  64.0  44.0  44.0  42.0        A        D   
4  33  23.5   5   24    2.0   33.0  24.0  23.5   5.0   2.0        A        D   
5  66  66.0  62   63   99.0   99.0  66.0  66.0  63.0  62.0        E        B   

  Col_Max3 Col_Max4 Col_Max5  
0        B        D        C  
1        C        B        A  
2        D        E        B  
3        C        B        E  
4        B        C        E  
5        A        D        C  

If you need the values and IDs in sequence, we need to modify the last few steps there -

df0 = pd.DataFrame(np.dstack((vals, IDs)).reshape(a.shape[0],-1))
df0.columns = np.vstack((names_vals, names_IDs)).T.ravel()
df_out = pd.concat([df, df0], axis=1)

Sample output -

In [62]: df_out
Out[62]: 
    A     B   C    D      E Max1 Col_Max1 Max2 Col_Max2  Max3 Col_Max3 Max4  \
0  23  12.0   1    9  123.0  123        E   23        A    12        B    9   
1  24  33.0  33  343   33.0  343        D   33        E    33        C   33   
2  55   0.2  66    4    2.2   66        C   55        A     4        D  2.2   
3  77  44.0  44   64   42.0   77        A   64        D    44        C   44   
4  33  23.5   5   24    2.0   33        A   24        D  23.5        B    5   
5  66  66.0  62   63   99.0   99        E   66        B    66        A   63   

  Col_Max4 Max5 Col_Max5  
0        D    1        C  
1        B   24        A  
2        E  0.2        B  
3        B   42        E  
4        C    2        E  
5        D   62        C  
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This looks great, let me give it a go. – RaceFrog Aug 25 '17 at 21:32
  • This works perfectly. Thanks a ton. I need to work through all of this logic a few more times to fully understand. Also, thanks for the 2 different cases! – RaceFrog Aug 25 '17 at 22:04
  • Just a side note - this works great to produce a dataframe that has the full order of all the columns, and the locations. Effectively tripling the size of the dataframe. If I just want the top 3(x), I call drop() on df_vals and df_IDs for everything past X number of columns. I think with this method you still need to create the large dataframes at least once to effectively get the sorting and labels. Still great. – RaceFrog Aug 26 '17 at 00:33