How to identify row and column based on top-K values in Pandas data frame

Question

I have a data frame created this way:

import pandas as pd
d = {'gene' : ['foo', 'qux', 'bar', 'bin'],
     'one' : [1., 2., 3., 1.],
     'two' : [4., 3., 2., 1.],
     'three' : [1., 2., 20., 1.],
     }

df = pd.DataFrame(d)

# # List top 5 values
# ndf = df[['one','two','three']]
# top = ndf.values.flatten().tolist()
# top.sort(reverse=True)
# top[0:5]
# [20.0, 4.0, 3.0, 3.0, 2.0]

It looks like this:

In [58]: df
Out[58]:
  gene  one  three  two
0  foo    1      1    4
1  qux    2      2    3
2  bar    3     20    2
3  bin    1      1    1

What I want to do is to collapse all values in 2nd column onwards. Get the top 5 scores and identify the corresponding row/column of that selected rows:

Then the desired dictionary will look like this:

{'foo':['two'],
'qux':['one','two','three'],
'bar':['one','two','three']}

How can I achieve that?

score 4 · Accepted Answer · answered Mar 01 '16 at 12:05

You can stack the dataframe, then get the largest 5 values (I used rank because it seems like you want to include all ties), and then group by gene to get the dictionary.

In [2]: df_stack = df.set_index('gene').stack()

In [3]: df_top = df_stack.loc[df_stack.rank('min', ascending=False) <= 5]

In [4]: print df_top.reset_index(0).groupby('gene').groups
{'qux': ['one', 'three', 'two'], 'foo': ['two'], 'bar': ['one', 'three', 'two']}

tworec · Answer 2 · 2016-03-01T19:43:54.767

1

Here's the working but not clean pandas solution.

top5=top[0:5]
dt=df.set_index('gene').T
d={}
for col in dt.columns:
    idx_list=dt[col][dt[col].isin(top5)].index.tolist()
    if idx_list:
        d[col]=idx_list 
d

will return

{'bar': ['one', 'three', 'two'],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

edited Mar 01 '16 at 19:43

answered Mar 01 '16 at 09:27

tworec

4,409
2
29
34

score 1 · Answer 3 · answered Mar 01 '16 at 09:27

# Get n'th largest unique value from dataframe.
n = 5
threshold = pd.Series([col for row in df.iloc[:, 1:].values 
                       for col in row]).nlargest(n).iat[-1]

d = {}
for g, row in df.iloc[:, 1:].iterrows():
    vals = row[row.ge(threshold)].index.tolist()
    if vals:
        d[df.gene.iat[g]] = vals

>>> d
{'bar': ['one', 'three', 'two'],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

score 0 · Answer 4 · edited May 23 '17 at 11:53

Before starting, I set the gene column as index. This makes it easier to isolate numeric columns (like you did with ndf), and easier to return a dictionary later on:

df.set_index('gene', inplace=True)

I then proceed in two steps.

First, get the 5th largest value via numpy, in the spirit of this answer:

import numpy as np
a = df.as_matrix().flatten()
n_max = -np.partition(-a, 5)[4]

Using partition avoids sorting the entire array (like you did with top), which could be costly when the array is large.

Second, apply a lambda function to retrieve the column names:

df.apply(lambda row: row.index[row >= n_max].tolist(), axis=1).to_dict()

Note how, since each row is a Series, a row's indices are the dataframe's columns. Result:

{'bar': ['one', 'three', 'two'],
 'bin': [],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

How to identify row and column based on top-K values in Pandas data frame

4 Answers4