Pandas max for rows, top n max

Question

Im trying to create top columns, which is the max of a couple of column rows. Pandas has a method nlargest but I cannot get it to work in rows. Pandas also has max and idxmax which does exactly what I want to do but only for the absolute max value.

df = pd.DataFrame(np.array([[1, 2, 3, 5, 1, 9], [4, 5, 6, 2, 5, 9], [7, 8, 9, 2, 5, 10]]), columns=['a', 'b', 'c', 'd', 'e', 'f'])
cols = df.columns[:-1].tolist()

df['max_1_val'] = df[cols].max(axis=1)
df['max_1_col'] = df[cols].idxmax(axis=1)

Output:

    a   b   c   d   e   f   max_1_val   max_1_col
0   1   2   3   5   1   9   5           d
1   4   5   6   2   5   9   6           c
2   7   8   9   2   5   10  9           c

But I am trying to get max_n_val and max_n_col so the expected output for top 3 would be:

    a   b   c   d   e   f   max_1_val   max_1_col   max_2_val   max_2_col   max_3_val   max_3_col
0   1   2   3   5   1   9   5           d           3           c           2           b
1   4   5   6   2   5   9   6           c           5           b           5           e
2   7   8   9   2   5   10  9           c           8           b           7           a

I'm not posting as an answer as it's not complete. But this might get you started: `df[['max_1', 'max_2', 'max_3']] = df.T.nlargest(3, columns=[0]).T`. Basically, you're transposing the original frame to calc `nlargest` and storing to new columns, then transposing back into its original form. — S3DEV, Mar 11 '20 at 09:54

jezrael · Accepted Answer · 2020-03-11T10:07:36.450

1

For improve performance is used numpy.argsort for positions, for correct order is used the last 3 items, reversed by indexing:

N = 3
a = df[cols].to_numpy().argsort()[:, :-N-1:-1]
print (a)
[[3 2 1]
 [2 4 1]
 [2 1 0]]

Then get columns names by indexing to c and for reordering values in d use this solution:

c = np.array(cols)[a]
d = df[cols].to_numpy()[np.arange(a.shape[0])[:, None], a]

Last create DataFrames, join by concat and reorder columns names by DataFrame.reindex:

df1 = pd.DataFrame(c).rename(columns=lambda x : f'max_{x+1}_col')
df2 = pd.DataFrame(d).rename(columns=lambda x : f'max_{x+1}_val')

c = df.columns.tolist() + [y for x in zip(df2.columns, df1.columns) for y in x]

df = pd.concat([df, df1, df2], axis=1).reindex(c, axis=1)
print (df)
   a  b  c  d  e   f  max_1_val max_1_col  max_2_val max_2_col  max_3_val  \
0  1  2  3  5  1   9          5         d          3         c          2   
1  4  5  6  2  5   9          6         c          5         e          5   
2  7  8  9  2  5  10          9         c          8         b          7   

  max_3_col  
0         b  
1         b  
2         a

edited Mar 11 '20 at 10:07

answered Mar 11 '20 at 09:54

jezrael

822,522
95
1,334
1,252

Im getting "AttributeError: 'DataFrame' object has no attribute 'to_numpy'" but I updated both numpy and pandas – destinychoice Mar 11 '20 at 10:03
@destinychoice - Then is possible change `.numpy()` to `.values`, with no `()` ? – jezrael Mar 11 '20 at 10:08
1

yes, changing `to_numpy()` to `.values` works fine, nice one – destinychoice Mar 11 '20 at 10:10
must be something with my env because `to_numpy()` works fine in the python shell – destinychoice Mar 11 '20 at 10:13

Pandas max for rows, top n max

1 Answers1