24

I have a multi-index DataFrame created via a groupby operation. I'm trying to do a compound sort using several levels of the index, but I can't seem to find a sort function that does what I need.

Initial dataset looks something like this (daily sales counts of various products):

         Date Manufacturer Product Name Product Launch Date  Sales
0  2013-01-01        Apple         iPod          2001-10-23     12
1  2013-01-01        Apple         iPad          2010-04-03     13
2  2013-01-01      Samsung       Galaxy          2009-04-27     14
3  2013-01-01      Samsung   Galaxy Tab          2010-09-02     15
4  2013-01-02        Apple         iPod          2001-10-23     22
5  2013-01-02        Apple         iPad          2010-04-03     17
6  2013-01-02      Samsung       Galaxy          2009-04-27     10
7  2013-01-02      Samsung   Galaxy Tab          2010-09-02      7

I use groupby to get a sum over the date range:

> grouped = df.groupby(['Manufacturer', 'Product Name', 'Product Launch Date']).sum()
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

So far so good!

Now the last thing I want to do is sort each manufacturer's products by launch date, but keep them grouped hierarchically under Manufacturer - here's all I am trying to do:

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

When I try sortlevel() I lose the nice per-company hierarchy I had before:

> grouped.sortlevel('Product Launch Date')
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
Apple        iPad         2010-04-03              30
Samsung      Galaxy Tab   2010-09-02              22

sort() and sort_index() just fail:

grouped.sort(['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

grouped.sort_index(by=['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

Seems like a simple operation, but I can't quite figure it out.

I'm not tied to using a MultiIndex for this, but since that's what groupby() returns, that's what I've been working with.

BTW the code to produce the initial DataFrame is:

data = {
  'Date': ['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-02', '2013-01-02', '2013-01-02', '2013-01-02'],
  'Manufacturer' : ['Apple', 'Apple', 'Samsung', 'Samsung', 'Apple', 'Apple', 'Samsung', 'Samsung',],
  'Product Name' : ['iPod', 'iPad', 'Galaxy', 'Galaxy Tab', 'iPod', 'iPad', 'Galaxy', 'Galaxy Tab'], 
  'Product Launch Date' : ['2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02','2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02'],
  'Sales' : [12, 13, 14, 15, 22, 17, 10, 7]
}
df = DataFrame(data, columns=['Date', 'Manufacturer', 'Product Name', 'Product Launch Date', 'Sales'])
Keeth
  • 2,100
  • 2
  • 21
  • 29
  • 3
    "Data will be lexicographically sorted by the chosen level *followed by the other levels (in order)*" (that sucks...) – Andy Hayden Jun 21 '13 at 19:53

5 Answers5

11

A hack would be to change the order of the levels:

In [11]: g
Out[11]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

In [12]: g.index = g.index.swaplevel(1, 2)

Sortlevel, which (as you've found) sorts the MultiIndex levels in order:

In [13]: g = g.sortlevel()

And swap back:

In [14]: g.index = g.index.swaplevel(1, 2)

In [15]: g
Out[15]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

I'm of the opinion that sortlevel should not sort the remaining labels in order, so will create a github issue. :) Although it's worth mentioning the docnote about "the need for sortedness".

Note: you could avoid the first swaplevel by reordering the order of the initial groupby:

g = df.groupby(['Manufacturer', 'Product Launch Date', 'Product Name']).sum()
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    [This doc note](http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-need-for-sortedness) suggests that the levels need to be sorted, although apparently this is just an implementation detail. It's not clear whether this means they must be sorted hierarchically from highest to lowest index level. – BrenBarn Jun 21 '13 at 20:03
  • @BrenBarn It's a good point, I've heard Jeff go on about that before... :) – Andy Hayden Jun 21 '13 at 20:05
  • Incidentally, can't you eliminate the extra swapping/sorting in your solution by doing the initial groupby in the swapped order (then just swaplevel after the group)? – BrenBarn Jun 21 '13 at 20:06
  • @BrenBarn Thanks (will mention this)! :) – Andy Hayden Jun 21 '13 at 20:08
7

This one liner works for me:

In [1]: grouped.sortlevel(["Manufacturer","Product Launch Date"], sort_remaining=False)

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

Note this works too:

groups.sortlevel([0,2], sort_remaining=False)

This wouldn't have worked when you originally posted over two years ago, because sortlevel by default sorted on ALL indices which mucked up your company hierarchy. sort_remaining which disables that behavior was added last year. Here's the commit link for reference: https://github.com/pydata/pandas/commit/3ad64b11e8e4bef47e3767f1d31cc26e39593277

Jim
  • 568
  • 1
  • 7
  • 14
  • 1
    Thanks for posting an updated answer. I had a three level multi-index and only wanted to sort by the first two. This worked perfectly. – Arjun Kumar Dec 01 '15 at 00:59
6

To sort a MultiIndex by the "index columns" (aka. levels) you need to use the .sort_index() method and set its level argument. If you want to sort by multiple levels, the argument needs to be set to a list of level names in sequential order.

This should give you the DataFrame you need:

df.groupby(['Manufacturer',
            'Product Name', 
            'Launch Date']
          ).sum().sort_index(level=['Manufacturer','Launch Date'])
fpersyn
  • 1,045
  • 1
  • 12
  • 19
  • 2
    You can also set the `ascending` argument with a list of boolean values to control the directions for each level seperately. e.g. `.sort_index(level=['Manufacturer','LaunchDate'], ascending=[True,False])`. – fpersyn May 31 '19 at 15:27
0

If you want try to avoid multiple swaps within a very deep MultiIndex, you also could try with this:

  1. Slicing by level X (by list comprehension + .loc + IndexSlice)
  2. Sort the desired level (sortlevel(2))
  3. Concatenate every group of level X indexes

Here you have the code:

import pandas as pd
idx = pd.IndexSlice
g = pd.concat([grouped.loc[idx[i,:,:],:].sortlevel(2) for i in grouped.index.levels[0]])
g
Xavi
  • 179
  • 1
  • 4
0

If you are not concerned about conserving the index (I often prefer an arbitrary integer index) you can just use the following one-liner:

grouped.reset_index().sort(["Manufacturer","Product Launch Date"])