top k columns with values in pandas dataframe for every row

Question

I have a pandas dataframe like the following:

   A  B  C  D
0  7  2  5  2
1  3  3  1  1
2  0  2  6  1
3  3  6  2  9

There can be 100s of columns, in the above example I have only shown 4.

I would like to extract top-k columns for each row and their values.

I can get the top-k columns using:

pd.DataFrame({n: df.T[column].nlargest(k).index.tolist() for n, column in enumerate(df.T)}).T

which, for k=3 gives:

   0  1  2
0  A  C  B
1  A  B  C
2  C  B  D
3  D  B  A

But what I would like to have is:

   0  1  2  3  4  5
0  A  7  C  5  B  2
1  A  3  B  3  C  1
2  C  6  B  2  D  1
3  D  9  B  6  A  3

Is there a pand(a)oic way to achieve this?

score 2 · Accepted Answer · edited May 23 '17 at 12:31

You can use numpy solution:

numpy.argsort for columns names
array already sort (thanks Jeff), need values by indices
interweave for new array
DataFrame constructor

k = 3
vals = df.values
arr1 = np.argsort(-vals, axis=1)

a = df.columns[arr1[:,:k]]
b = vals[np.arange(len(df.index))[:,None], arr1][:,:k]

c = np.empty((vals.shape[0], 2 * k), dtype=a.dtype)
c[:,0::2] = a
c[:,1::2] = b
print (c)
[['A' 7 'C' 5 'B' 2]
 ['A' 3 'B' 3 'C' 1]
 ['C' 6 'B' 2 'D' 1]
 ['D' 9 'B' 6 'A' 3]]

df = pd.DataFrame(c)
print (df)
   0  1  2  3  4  5
0  A  7  C  5  B  2
1  A  3  B  3  C  1
2  C  6  B  2  D  1
3  D  9  B  6  A  3

this is non nonperformant and the point of nlargest which is a partition sort ; argsort sorts everything — Jeff, Mar 01 '17 at 14:52

Leon · Answer 2 · 2017-03-01T15:16:57.777

1

>>> def foo(x):
...     r = []
...     for p in zip(list(x.index), list(x)):
...             r.extend(p)
...     return r
... 
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
   0  1  2  3  4  5
0  A  7  C  5  B  2
1  A  3  B  3  C  1
2  C  6  B  2  D  1
3  D  9  B  6  A  3

Or, using list comprehension:

>>> def foo(x):
...     return [j for i in zip(list(x.index), list(x)) for j in i]
... 
>>> pd.DataFrame({n: foo(df.T[row].nlargest(k)) for n, row in enumerate(df.T)}).T
   0  1  2  3  4  5
0  A  7  C  5  B  2
1  A  3  B  3  C  1
2  C  6  B  2  D  1
3  D  9  B  6  A  3

edited Mar 01 '17 at 15:16

answered Mar 01 '17 at 14:41

Leon

31,443
4
72
97

good solution but the lists and for loop make it quite slow if the number of rows in dataframe are of the order of 10k+ – Abhishek Thakur Mar 01 '17 at 14:58
@AbhishekThakur I added a variant of the same solution using list comprehension, though I don't have any idea about its performance. – Leon Mar 01 '17 at 15:04
Ahh, it seems that's not the problem. the problem arises when the function is being applied to every row of the pandas dataframe one-by-one :) – Abhishek Thakur Mar 01 '17 at 15:13

B. M. · Answer 3 · 2017-03-01T18:28:09.377

1

This does the job efficiently : It uses argpartition that found the n biggest in O(n), then sort only them.

values=df.values
n,m=df.shape
k=4
I,J=mgrid[:n,:m]
I=I[:,:1]
if k<m: J=(-values).argpartition(k)[:,:k]
values=values[I,J]
names=np.take(df.columns,J)
J2=(-values).argsort()
names=names[I,J2]
values=values[I,J2]
names_and_values=np.empty((n,2*k),object)
names_and_values[:,0::2]=names
names_and_values[:,1::2]=values
result=pd.DataFrame(names_and_values)

For

   0  1  2  3  4  5
0  A  7  C  5  B  2
1  B  3  A  3  C  1
2  C  6  B  2  D  1
3  D  9  B  6  A  3

edited Mar 01 '17 at 18:28

answered Mar 01 '17 at 15:44

B. M.

18,243
2
35
54

this throws the following error: ----> 4 d3=d[I,big3] IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (4,1) (10,3) – Abhishek Thakur Mar 01 '17 at 15:53
I don't understand why; I adapt some things for better generality. What numpy version do you use ? – B. M. Mar 01 '17 at 16:15
numpy version is: 1.12.0 – Abhishek Thakur Mar 01 '17 at 16:18
I think the last version work : the problem was that your df was longer(10) than the exemple(4). – B. M. Mar 01 '17 at 16:24
this is wayyy faster than other methods... for my particular problem which was an extension to my question, it improved performance by 300%! – Abhishek Thakur Mar 01 '17 at 16:52
1

I don't think this gives the desired result compared to `nlargest`. It so happened to be the case here. – Nickil Maveli Mar 01 '17 at 16:55
@Nickil Maveli : where is the problem ? – B. M. Mar 01 '17 at 17:09
also, its not working if i need the maximum number of columns. eg. k=4 ----> 5 big3=(-d).argpartition(k)[:,:k] ValueError: kth(=4) out of bounds (4) – Abhishek Thakur Mar 01 '17 at 17:47
Ok, I have adapted for this case. – B. M. Mar 01 '17 at 18:26

top k columns with values in pandas dataframe for every row

3 Answers3