1

I have a dataframe like in this one:

df = pd.DataFrame({'a':[1,2,1],'b':[4,6,0],'c':[0,4,8]})
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 0 |
+---+---+---+
| 2 | 6 | 4 |
+---+---+---+
| 1 | 0 | 8 |
+---+---+---+

for each row, I need (both) the 'n' (in this case two) highest values and the corresponding column in descending order:

row 1: 'b':4,'a':1
row 2: 'b':6,'c':4
row 3: 'c':8,'a':1
smci
  • 32,567
  • 20
  • 113
  • 146
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • Is it guaranteed that there are exactly three columns and their names are `a,b,c` or do you want a general answer? – smci Nov 05 '16 at 01:22
  • I want a general answer, just used three columns for the sake of simplicity – Luis Ramon Ramirez Rodriguez Nov 05 '16 at 01:23
  • 1
    Possible duplicate of [Selecting top N columns for each row in data frame](http://stackoverflow.com/questions/34297319/selecting-top-n-columns-for-each-row-in-data-frame) – smci Nov 05 '16 at 01:26
  • 1
    Actually it's a perfect duplicate of [Find top-n highest-value columns in each pandas dataframe row](https://stackoverflow.com/questions/38955182/find-top-n-highest-value-columns-in-each-pandas-dataframe-row) with `nlargest = 2`. There's your answer. (I can't redirect my close vote now.) – smci Nov 05 '16 at 01:33
  • @smci thanks, but I'm not sure if this is the same. I need the correspondence between the values. I need to know on wich column were the top values at the beginning. – Luis Ramon Ramirez Rodriguez Nov 05 '16 at 01:42
  • I should have renamed it [Find names of top-n highest-value columns... in row](https://stackoverflow.com/questions/38955182/find-names-of-top-n-highest-value-columns-in-each-pandas-dataframe-row) – smci Nov 05 '16 at 01:44
  • 2
    Ok I see now you wrote "for each row, I need (EDIT: **both**) the top-n values and the corresponding column in decending order". Yeah sorry, that's slightly different. – smci Nov 05 '16 at 01:46
  • By the way, your output rownames are the not-so-Pythonic 1,2,3 instead of 0,1,2 – smci Nov 05 '16 at 05:10
  • A near-duplicate, except with floats, and using column indices instead of names: [For each dataframe row, get both the top-n values and the column-indices where they occur](http://stackoverflow.com/questions/36518092/for-each-dataframe-row-get-both-the-top-n-values-and-the-column-indices-where-t) – smci Nov 05 '16 at 05:22

1 Answers1

4

Here are two ways, both adapt from @unutbu's answer to "Find names of top-n highest-value columns in each pandas dataframe row"

1) Use Python Decorate-Sort-Undecorate with a .apply(lambda ...) on each row to insert the column names, do the np.argsort, keep the top-n, reformat the answer. (I think this is cleaner).

import numpy as np

# Apply Decorate-Sort row-wise to our df, and slice the top-n columns within each row...

sort_decr2_topn = lambda row, nlargest=2:
    sorted(pd.Series(zip(df.columns, row)), key=lambda cv: -cv[1]) [:nlargest]

tmp = df.apply(sort_decr2_topn, axis=1)

0    [(b, 4), (a, 1)]
1    [(b, 6), (c, 4)]
2    [(c, 8), (a, 1)]

# then your result (as a pandas DataFrame) is...
np.array(tmp)
array([[('b', 4), ('a', 1)],
       [('b', 6), ('c', 4)],
       [('c', 8), ('a', 1)]], dtype=object)
# ... or as a list of rows is
tmp.values.tolist()
#... and you can insert the row-indices 0,1,2 with 
zip(tmp.index, tmp.values.tolist())
[(0, [('b', 4), ('a', 1), ('c', 0)]), (1, [('b', 6), ('c', 4), ('a', 2)]), (2, [('c', 8), ('a', 1), ('b', 0)])]

2) Get the matrix of topnlocs as follows, then use it both to reindex into df.columns, and df.values, and combine that output:

import numpy as np

nlargest = 2
topnlocs = np.argsort(-df.values, axis=1)[:, 0:nlargest]
# ... now you can use topnlocs to reindex both into df.columns, and df.values, then reformat/combine them somehow
# however it's painful trying to apply that NumPy array of indices back to df or df.values,

See How to get away with a multidimensional index in pandas

smci
  • 32,567
  • 20
  • 113
  • 146
  • So I went the extra five miles and gave you working code. It was quite painful. Option 1) sounds less pandas-thonic, but works better. – smci Nov 05 '16 at 05:10
  • This is exactly what I want, except 1) `ix` is deprecated, 2) changing that line to use `iloc` produces `Too many indexers`. I'm using pandas 0.25.3. – dfrankow Nov 14 '19 at 17:48
  • I'm getting syntax error in sort_decr2_topn function – Abdul Haseeb Jan 24 '22 at 19:30