1

I can find the n largest values in each row of a numpy array (link) but doing so loses the column information which is what I want. Say I have some data:

import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.rand(5,5)
data = pd.DataFrame(data, columns = list('abcde'))
data

          a         b         c         d         e
0  0.374540  0.950714  0.731994  0.598658  0.156019
1  0.155995  0.058084  0.866176  0.601115  0.708073
2  0.020584  0.969910  0.832443  0.212339  0.181825
3  0.183405  0.304242  0.524756  0.431945  0.291229
4  0.611853  0.139494  0.292145  0.366362  0.456070

I want the names of the largest contributors in each row. So for n = 2 the output would be:

0  b  c
1  c  e
2  b  c
3  c  d
4  a  e

I can do it by looping over the dataframe but that would be inefficient. Is there a more pythonic way?

R Walser
  • 330
  • 3
  • 16

3 Answers3

2

With pandas.Series.nlargest function:

df.apply(lambda x: x.nlargest(2).index.values, axis=1)

0    [b, c]
1    [c, e]
2    [b, c]
3    [c, d]
4    [a, e]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

Another option using numpy.argpartition to find the top n index per row and then extract column names by index:

import numpy as np
nlargest_index = np.argpartition(data.values, data.shape[1] - n)[:, -n:]
data.columns.values[nlargest_index]

#array([['c', 'b'],
#       ['e', 'c'],
#       ['c', 'b'],
#       ['d', 'c'],
#       ['e', 'a']], dtype=object)
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Your solution has the advantage that it gives the indices as well. How can I use this array of indices to extract the corresponding value from the original dataframe? – R Walser Feb 11 '23 at 09:48
  • 1
    You can use [advanced indexing](https://www.tutorialspoint.com/numpy/numpy_advanced_indexing.htm): `data.values[np.arange(len(data))[:,None], nlargest_index]` – Psidom Feb 11 '23 at 09:56
  • The code works but transferring it to the original data it seems that it only works if there are no nan's in the data. – R Walser Feb 11 '23 at 10:06
  • Order is not assured with argpartition, the nlargest solution is relatively slower but ensures order – sammywemmy Feb 11 '23 at 10:09
1

Can a dense ranking be used for this?

N = 2
threshold = len(data.columns) - N
nlargest = data[data.rank(method="dense", axis=1) > threshold]
>>> nlargest
          a         b         c         d         e
0       NaN  0.950714  0.731994       NaN       NaN
1       NaN       NaN  0.866176       NaN  0.708073
2       NaN  0.969910  0.832443       NaN       NaN
3       NaN       NaN  0.524756  0.431945       NaN
4  0.611853       NaN       NaN       NaN  0.456070
>>> nlargest.stack()
0  b    0.950714
   c    0.731994
1  c    0.866176
   e    0.708073
2  b    0.969910
   c    0.832443
3  c    0.524756
   d    0.431945
4  a    0.611853
   e    0.456070
dtype: float64
jqurious
  • 9,953
  • 1
  • 4
  • 14