8

given a data frame with one descriptive column and X numeric columns, for each row I'd like to identify the top N columns with the higher values and save it as rows on a new dataframe.

For example, consider the following data frame:

df = pd.DataFrame()
df['index'] = ['A', 'B', 'C', 'D','E', 'F']
df['option1'] = [1,5,3,7,9,3]
df['option2'] = [8,4,5,6,9,2]
df['option3'] = [9,9,1,3,9,5]
df['option4'] = [3,8,3,5,7,0]
df['option5'] = [2,3,4,9,4,2]

enter image description here

I'd like to output (lets say N is 3, so I want the top 3):

A,option3
A,option2
A,option4

B,option3
B,option4
B,option1

C,option2
C,option5
C,option4 (or option1 - ties arent really a problem)

D,option5
D,option1
D,option2

and so on....

any idea how that can be easily achieved? Thanks

smci
  • 32,567
  • 20
  • 113
  • 146
Diego
  • 34,802
  • 21
  • 91
  • 134
  • 2
    what kind of format do you want? – Padraic Cunningham Dec 15 '15 at 19:38
  • Since the OP never replied, let's make the reasonable assumption they want a dataframe, not a list-of-lists or whatever else. – smci Nov 05 '16 at 01:28
  • Retitled since the OP apparently wants *"Finding top N columns"* instead of *"Selecting top N columns..."*, which would be a pandas operation with df output. – smci Nov 05 '16 at 01:38

5 Answers5

3

Let's assume

N = 3

First of all I will create matrix of input fields and for each field remember what was original option for this cell:

matrix = [[(j, 'option' + str(i)) for j in df['option' + str(i)]] for i in range(1,6)]

The result of this line will be:

[
 [(1, 'option1'), (5, 'option1'), (3, 'option1'), (7, 'option1'), (9, 'option1'), (3, 'option1')],
 [(8, 'option2'), (4, 'option2'), (5, 'option2'), (6, 'option2'), (9, 'option2'), (2, 'option2')],
 [(9, 'option3'), (9, 'option3'), (1, 'option3'), (3, 'option3'), (9, 'option3'), (5, 'option3')],
 [(3, 'option4'), (8, 'option4'), (3, 'option4'), (5, 'option4'), (7, 'option4'), (0, 'option4')],
 [(2, 'option5'), (3, 'option5'), (4, 'option5'), (9, 'option5'), (4, 'option5'), (2, 'option5')]
]

Then we can easly transform matrix using zip function, sort result rows by first element of tuple and take N first items:

transformed = [sorted(l, key=lambda x: x[0], reverse=True)[:N] for l in zip(*matrix)]

List transformed will look like:

[
 [(9, 'option3'), (8, 'option2'), (3, 'option4')],
 [(9, 'option3'), (8, 'option4'), (5, 'option1')],
 [(5, 'option2'), (4, 'option5'), (3, 'option1')],
 [(9, 'option5'), (7, 'option1'), (6, 'option2')],
 [(9, 'option1'), (9, 'option2'), (9, 'option3')],
 [(5, 'option3'), (3, 'option1'), (2, 'option2')]
]

The last step will be joining column index and result tuple by:

for id, top in zip(df['index'], transformed):
    for option in top:
        print id + ',' + option[1]
    print ''
kubked
  • 137
  • 5
  • thats interesting solution but it relies on pre-defined column naming. I used option1, option2,... ofr simplicity, the names don't follow a logic and can be different depending on the situation. But thanks for helping – Diego Dec 17 '15 at 12:23
3

If you just want pairings:

from operator import itemgetter as it
from itertools import repeat
n = 3

 # sort_values = order pandas < 0.17
new_d = (zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems())))
                 for _, row in df.iterrows())
for row in new_d:
    print(list(row))

Output:

[('B', 'option3'), ('B', 'option4'), ('B', 'option1')]
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')]
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')]
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')]
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]

Which also maintains the order.

If you want a list of lists:

from operator import itemgetter as it
from itertools import repeat
n = 3

new_d = [list(zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems()))))
                 for _, row in df.iterrows()]

Output:

[[('A', 'option3'), ('A', 'option2'), ('A', 'option4')],
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')], 
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')], 
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')], 
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')],
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]]

Or using pythons sorted:

new_d = [list(zip(repeat(row["index"]), map(it(0), sorted(row[1:].iteritems(), key=it(1) ,reverse=1)[:n])))
                     for _, row in df.iterrows()]

Which is actually the fastest, if you really want strings, it is pretty trivial to format the output however you want.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • This yields the values and not the columns names. – iled Dec 15 '15 at 20:13
  • @lied, the OP may or may not want the names, it is trivial to change if they do, I asked in a comment to clarify – Padraic Cunningham Dec 15 '15 at 20:18
  • thanks Padraic, I have an example of the desired output on the question. Nevertheless, any idea why your code give me this error: AttributeError: 'Series' object has no attribute 'items' n the "pd.DataFrame(map(it(0), sorted(row[1:].items(), key=it(1) ,reverse=1)[:n]) for _, row in df.iterrows())" line? – Diego Dec 17 '15 at 12:20
  • @Diego, you probably need iteritems in python 2, your output does not show if you actually want another dataframe or what exactly, what you accepted answer is doing can be done in a line or two of code – Padraic Cunningham Dec 17 '15 at 12:27
  • hey, thanks for your reply, I'd be very queen to try your approach so I used "iteritems". It then fails on the print(list(row)) with an "KeyError: 'index'". If I remove the "list" and just use the print (row) it doesn't fail but doesn't print anything. Do you think it may be another python 2.7 X 3 difference? thanks again for your time – Diego Dec 17 '15 at 14:07
  • @Diego, what version of pandas are you using? – Padraic Cunningham Dec 17 '15 at 14:10
  • 0.16.2, Also, Im working on Jupiter (not sure if that matters): Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 16:44:52) [MSC v.1500 64 bit (AMD64)] – Diego Dec 17 '15 at 14:29
  • I only have 0.17.0 and 17.1 installed, you would need to change sort_values t order but I have run the code using 2.7 and every example works as posted. – Padraic Cunningham Dec 17 '15 at 14:35
  • Run the code using an explicit loop http://pastebin.com/AjDV0n1m and see what row looks like – Padraic Cunningham Dec 17 '15 at 14:37
  • thank you, it looks a lot better with the explicit loop...not sure whats going on, probably a version thing; Just to abuse your will a lottle longer, can you explain what does the "map(it(0)" is doing in this example? I see that "row[1:].order(ascending=0)[:n].iteritems()" returns an "itertools.izip" object, but not sure what "map" is doing. – Diego Dec 17 '15 at 16:17
  • @Diego, we sort/order by the value i.e the number but we want the string i,e `option3` etc.. `map(itemgetter(0)..` pulls the first element which is each string from the tuples returned from `.iteritems`. Each tuple would be `("option_x",i)` – Padraic Cunningham Dec 17 '15 at 16:24
  • I get "ValueError: ascending must be boolean". Any idea why? – Nivi Mar 19 '18 at 19:38
1
dfc = df.copy()
result = {}

#First, I would effectively transpose this

for key in dfc:
    if key != 'index':
        for i in xrange(0,len(dfc['index'])):
            if dfc['index'][i] not in result:
                result[dfc['index'][i]] = []
            result[dfc['index'][i]] += [(key,dfc[key][i])]


def get_topn(result,n):
    #Use this to get the top value for each option
    return [x[0] for x in sorted(result,key=lambda x:-x[1])[0:min(len(result),n)]]


#Lastly, print the output in your desired format.
n = 3
keys = sorted([k for k in result])
for key in keys:
      for option in get_topn(result[key],n):
         print str(key) + ',' + str(option)
      print
  • thanks Adam, that was really helpful, the only problem was that the order of the ids changed in the end due to the dictionary transformation. I solved that by sorting the "keys" using the original dataframe. Little hacky but thats ok – Diego Dec 17 '15 at 12:17
0

This might not be so elegant, but I think it pretty much gets what you want:

n = 3
df.index = pd.Index(df['index'])
del df['index']
df = df.transpose().unstack()
for i, g in df.groupby(level=0):
    g = g.sort_values(ascending=False)
    print i, list(g.index.get_level_values(1)[:n])
itzy
  • 11,275
  • 15
  • 63
  • 96
0

Yet another crazy one-liner, given n = 3

{index:option for (index, option) in zip(df['index'], 
    [df.columns[pd.notnull(x[1].where(x[1][1:].sort_values()[-n:]))].tolist()
        for x in df.iterrows()])}

{'A': ['option2', 'option3', 'option4'],
 'C': ['option2', 'option4', 'option5'],
 'B': ['option1', 'option3', 'option4'],
 'E': ['option1', 'option2', 'option3'],
 'D': ['option1', 'option2', 'option5'],
 'F': ['option1', 'option3', 'option5']}
iled
  • 2,142
  • 3
  • 31
  • 43