Find the column names which have top 3 largest values for each row

Question

For example the data look like:

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
x=pd.DataFrame(df)
Out[260]: 
    a1  a2  a3  a4  a5  a6
0   5  23   4   1   4   5
1   6  43   2   2   98   43
2   3  56   3   1  23   3
3   2   2   6   3   5   2
4   5   6   7   2   7   5

I need the result to look like:

top1 top2 top3
a2   a1   a6
a5   a2   a6
....

I've seen answer to a previous questions (see below) that recommends idxmax. But how to handle top n values (n>1)?

Find the column name which has the maximum value for each row

Update:

I find the answer very useful but the only thing is that my data is long so have to figure out a way to bypass that. I ended up saving the data to a csv file and then reading it back in in chunks. here is the code I used:

data = pd.read_csv('xxx.csv', chunksize=1000)
rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for chunk in data:
    x=pd.DataFrame(chunk).T
    for i in x.columns:
        df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
        rslt = pd.concat([rslt, df1row], axis=0)
rslt=rslt.reset_index(drop=True)

score 11 · Answer 1 · answered Jun 17 '20 at 10:39

import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
df=pd.DataFrame(df)

df


   a1  a2  a3  a4  a5  a6
0   5  23   4   1   4   5
1   6  43   2   2  98  43
2   3  56   3   1  23   3
3   2   2   6   3   5   2
4   5   6   7   2   7   5

We can solve it using the argsortfrom numpy and apply , lambda from pandas. The solution:

Tops =pd.DataFrame(df.apply(lambda x:list(df.columns[np.array(x).argsort()[::-1][:3]]), axis=1).to_list(),  columns=['Top1', 'Top2', 'Top3'])


Tops

And we get:

  Top1 Top2 Top3
0   a2   a6   a1
1   a5   a6   a2
2   a2   a5   a6
3   a3   a5   a4
4   a5   a3   a2

great answer!! this helped my use case a ton – Hana Feb 23 '22 at 20:44 — Hana, Feb 23 '22 at 20:44

2342G456DI8 · Accepted Answer · 2016-05-28T05:32:52.350

6

What you need is pandas.DataFrame.nlargest.

import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}

x=pd.DataFrame(df).T

rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for i in x.columns:
    df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
    rslt = pd.concat([rslt, df1row], axis=0)

print rslt

Out[52]: 
  top1 top2 top3
0   a2   a1   a6
0   a5   a2   a6
0   a2   a5   a1
0   a3   a5   a4
0   a3   a5   a2

edited May 28 '16 at 05:32

answered May 28 '16 at 05:27

2342G456DI8

1,819
3
16
29

This works great! Only one thing: my actual dataset is quite large (long, over 400K rows) so cant really do transposing. I am trying to read the data into python in chunks to make it work, but just wondering in this case (large data size) is there any other way to handle it? Thx – CWeeks May 28 '16 at 19:11
@CWlearner, in the case of very large data-set, the chunks solution is all I can think of if you would like to done in pure python pandas approach, more infor you can refer to [“Large data” work flows using pandas](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas). But why don't you consider of using SQL instead, that would be much more easier. – 2342G456DI8 May 29 '16 at 05:48
how would you do this in SQL? @2342G456DI8 – Hana Feb 23 '22 at 20:41

Billy Bonaros · Answer 3 · 2020-06-17T11:33:04.513

1

You can do it like this:

x.T.apply(lambda x: x.sort_values(ascending=False).index).T.filter(['a1','a2','a3']).rename(columns={"a1":'top1',"a2":'top2',"a3":'top3'})

Results:

  top1 top2 top3
0   a2  a6  a1
1   a5  a6  a2
2   a2  a5  a6
3   a3  a5  a4
4   a5  a3  a2

edited Jun 17 '20 at 11:33

answered Jun 17 '20 at 11:19

Billy Bonaros

1,671
11
18

score 0 · Answer 4 · answered Jun 29 '18 at 13:19

an answer without the need to transpose your data.

first we define how to get the n largest value index from a list (or a row).

def get_n_largest_ind(a,n):
    ind = np.argpartition(a, -n)[-n:]
    return ind[0]

a = [0.4,0.5,0.4,0.8,0.2,0.1,0.99]
get_n_largest_ind(a,3)
>>>> 1

Now you can define which columns to use for the ranking. Then we loop over the number of top columns you want to show (in this example 3).

cols = df.columns
for n in [1,2,3]:
    df["top{}".format(n)] = df[cols].apply(lambda x: cols[get_n_largest_ind(x,n)],axis=1)

Find the column names which have top 3 largest values for each row

4 Answers4

Linked

Related