3

I have a pandas data frame like:

        column0     column1     column2     column3     column4
row0    179319.0    180895.0    94962.0     130734.0    0
row1    89659.5     90447.5     47481.0     65367.0     0
row2    59773.0     60298.33333 31654.0     43578.0     0
row3    44829.75    45223.75    23740.5     32683.5     0
row4    35863.8     36179.0     18992.4     26146.8     0
row5    29886.5     30149.16666 15827.0     21789.0     0
row6    25617.0     25842.14285 13566.0     18676.28571 0
row7    22414.875   22611.875   11870.25    16341.75    0
row8    19924.33333 20099.44444 10551.33333 14526.0     0

and I would like to get something like the index of the 9 (number of rows) highest values, or something like the count of the highest values for each column like:

column0  column1  column2  column3  column4
3        3        1        2        0

In my example the 9 highest values would be the ones from column0, column1, column2, and column3 from row0, the ones from column0, column1, and column3 from row1, and the ones from column0 and column1 from row2.

Any ideas? Thanks!

Oscar
  • 407
  • 6
  • 16
  • I just edited my question to make it more clear. – Oscar Apr 19 '19 at 19:06
  • And what about if there is a tie for the 9th largest value between two cells? Would you want just any one of the indices, or is it useful to return all ties for the 9th (effectively allowing you to obtain more than 9 counts)? – ALollz Apr 19 '19 at 19:39
  • I didn't think about that, but I would say that both solutions are good for my purpose (random one or both of them). – Oscar Apr 19 '19 at 19:51

4 Answers4

7

IIUC nlargest after stack

df.stack().nlargest(9).groupby(level=1).count().reindex(df.columns,fill_value=0)
Out[48]: 
column0    3
column1    3
column2    1
column3    2
column4    0
dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
4

IIUC, You can stack and sort,

df.stack().sort_values(ascending = False).iloc[:9]

row0  column1    180895.00000
      column0    179319.00000
      column3    130734.00000
      column2     94962.00000
row1  column1     90447.50000
      column0     89659.50000
      column3     65367.00000
row2  column1     60298.33333
      column0     59773.00000
Vaishali
  • 37,545
  • 5
  • 58
  • 86
0

pandas has a function called nlargest that will return the nlargest values of any column as a series. [docs]

If you want just the index of each, then you would use index.values [usage] [docs] or you could use .index.array as recommended in the docs.

If you wanted a count of all of the highest values, check out this answer here.

MyNameIsCaleb
  • 4,409
  • 1
  • 13
  • 31
  • My problem is that I want the highest values of the dataframe, not just a list. And I need to identify where all these values are (row and column). – Oscar Apr 19 '19 at 19:07
  • Got it, I'll leave my answer as reference to some of the docs. Wen-Ben and Vaishali have usable answers for what you clarified. – MyNameIsCaleb Apr 19 '19 at 19:13
0

Here's a solution using np + collections.Counter, which should be pretty fast to get the Counter object.

from collections import Counter
import numpy as np
import pandas as pd

c = Counter(np.tile(df.columns, df.shape[0])[np.argsort(df.to_numpy().ravel())[-9:]])
#Counter({'column0': 3, 'column1': 3, 'column2': 1, 'column3': 2})

s = pd.Series(c).reindex(df.columns).fillna(0).astype(int)
#column0    3
#column1    3
#column2    1
#column3    2
#column4    0
#dtype: int32
ALollz
  • 57,915
  • 7
  • 66
  • 89