11

For example the data look like:

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
x=pd.DataFrame(df)
Out[260]: 
    a1  a2  a3  a4  a5  a6
0   5  23   4   1   4   5
1   6  43   2   2   98   43
2   3  56   3   1  23   3
3   2   2   6   3   5   2
4   5   6   7   2   7   5

I need the result to look like:

top1 top2 top3
a2   a1   a6
a5   a2   a6
....

I've seen answer to a previous questions (see below) that recommends idxmax. But how to handle top n values (n>1)?

Find the column name which has the maximum value for each row

Update:

I find the answer very useful but the only thing is that my data is long so have to figure out a way to bypass that. I ended up saving the data to a csv file and then reading it back in in chunks. here is the code I used:

data = pd.read_csv('xxx.csv', chunksize=1000)
rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for chunk in data:
    x=pd.DataFrame(chunk).T
    for i in x.columns:
        df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
        rslt = pd.concat([rslt, df1row], axis=0)
rslt=rslt.reset_index(drop=True)
Community
  • 1
  • 1
CWeeks
  • 407
  • 1
  • 5
  • 15

4 Answers4

11
import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}
df=pd.DataFrame(df)

df


   a1  a2  a3  a4  a5  a6
0   5  23   4   1   4   5
1   6  43   2   2  98  43
2   3  56   3   1  23   3
3   2   2   6   3   5   2
4   5   6   7   2   7   5

We can solve it using the argsortfrom numpy and apply , lambda from pandas. The solution:

Tops =pd.DataFrame(df.apply(lambda x:list(df.columns[np.array(x).argsort()[::-1][:3]]), axis=1).to_list(),  columns=['Top1', 'Top2', 'Top3'])


Tops

And we get:

  Top1 Top2 Top3
0   a2   a6   a1
1   a5   a6   a2
2   a2   a5   a6
3   a3   a5   a4
4   a5   a3   a2
George Pipis
  • 1,452
  • 16
  • 12
6

What you need is pandas.DataFrame.nlargest.

import pandas as pd
import numpy as np

df={'a1':[5,6,3,2,5],'a2':[23,43,56,2,6], 'a3':[4,2,3,6,7], 'a4':[1,2,1,3,2],'a5':[4,98,23,5,7],'a6':[5,43,3,2,5]}

x=pd.DataFrame(df).T

rslt = pd.DataFrame(np.zeros((0,3)), columns=['top1','top2','top3'])
for i in x.columns:
    df1row = pd.DataFrame(x.nlargest(3, i).index.tolist(), index=['top1','top2','top3']).T
    rslt = pd.concat([rslt, df1row], axis=0)

print rslt

Out[52]: 
  top1 top2 top3
0   a2   a1   a6
0   a5   a2   a6
0   a2   a5   a1
0   a3   a5   a4
0   a3   a5   a2
2342G456DI8
  • 1,819
  • 3
  • 16
  • 29
  • This works great! Only one thing: my actual dataset is quite large (long, over 400K rows) so cant really do transposing. I am trying to read the data into python in chunks to make it work, but just wondering in this case (large data size) is there any other way to handle it? Thx – CWeeks May 28 '16 at 19:11
  • @CWlearner, in the case of very large data-set, the chunks solution is all I can think of if you would like to done in pure python pandas approach, more infor you can refer to [“Large data” work flows using pandas](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas). But why don't you consider of using SQL instead, that would be much more easier. – 2342G456DI8 May 29 '16 at 05:48
  • how would you do this in SQL? @2342G456DI8 – Hana Feb 23 '22 at 20:41
1

You can do it like this:

x.T.apply(lambda x: x.sort_values(ascending=False).index).T.filter(['a1','a2','a3']).rename(columns={"a1":'top1',"a2":'top2',"a3":'top3'})

Results:

  top1 top2 top3
0   a2  a6  a1
1   a5  a6  a2
2   a2  a5  a6
3   a3  a5  a4
4   a5  a3  a2
Billy Bonaros
  • 1,671
  • 11
  • 18
0

an answer without the need to transpose your data.

first we define how to get the n largest value index from a list (or a row).

def get_n_largest_ind(a,n):
    ind = np.argpartition(a, -n)[-n:]
    return ind[0]

a = [0.4,0.5,0.4,0.8,0.2,0.1,0.99]
get_n_largest_ind(a,3)
>>>> 1

Now you can define which columns to use for the ranking. Then we loop over the number of top columns you want to show (in this example 3).

cols = df.columns
for n in [1,2,3]:
    df["top{}".format(n)] = df[cols].apply(lambda x: cols[get_n_largest_ind(x,n)],axis=1)
Rick
  • 2,080
  • 14
  • 27