66

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.

The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?

Minimal code copied below. Thanks a lot!

PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.

import pandas as pd
import numpy as np

df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)


df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))

# This doesn't work: 
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')

# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')

# This below works but is not ideal: 
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']

mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)  
mypiv['month'] = mypiv['new field'] % 100
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • 3
    I've provided several detailed examples and alternative approaches in this [**Q&A**](https://stackoverflow.com/q/47152691/2336654) – piRSquared Nov 11 '17 at 22:19

4 Answers4

113

You can group and then unstack.

>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item        item 1  item 2
year month                
2004 1          33     250
     2          44     224
     3          41     268
     4          29     232
     5          57     252
     6          61     255
     7          28     254
     8          15     229
     9          29     258
     10         49     207
     11         36     254
     12         23     209

Or use pivot_table:

>>> df.pivot_table(
        values='value', 
        index=['year', 'month'], 
        columns='item', 
        aggfunc=np.sum)
item        item 1  item 2
year month                
2004 1          33     250
     2          44     224
     3          41     268
     4          29     232
     5          57     252
     6          61     255
     7          28     254
     8          15     229
     9          29     258
     10         49     207
     11         36     254
     12         23     209
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • 3
    @Alexander, `pivot_table()` requires _aggfunc_ parameter and if no such parameter is provided then `mean()` function is used by default. If `sum()` capability is required then `pivot_table()` function should have `aggfunc=sum` added to the call. **Source:** [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table) _aggfunc : function, list of functions, dict, default numpy.mean_ – rp1 Sep 07 '18 at 02:23
  • @rp1 Good point. It was identical because the original sample dataframe only had one value per item, year and month. Amending response to reflect correction. – Alexander Sep 07 '18 at 03:41
34

I believe if you include item in your MultiIndex, then you can just unstack:

df.set_index(['year', 'month', 'item']).unstack(level=-1)

This yields:

                value      
item       item 1 item 2
year month              
2004 1         21    277
     2         43    244
     3         12    262
     4         80    201
     5         22    287
     6         52    284
     7         90    249
     8         14    229
     9         52    205
     10        76    207
     11        88    259
     12        90    200

It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.

Ajean
  • 5,528
  • 14
  • 46
  • 69
  • 1
    You can also reference the index level directly, e.g. df.set_index(['year', 'month', 'item']).unstack('fcode'). Alexander's first method in his answer would also fail with non-numerical data (in a more generalized problem than the one here). – Carl Aug 23 '17 at 14:18
  • this is an amazing answer. – Jason Wolosonovich Dec 21 '17 at 21:08
  • I like this answer, but I'm having some trouble with its implementation. I have multiple columns of data, that I want to flatten to a single row. This answer gets me most of where I need to go, but I'm getting a ton of NaN's in my data. Any ideas? – Adrian Keister Aug 13 '19 at 22:25
  • Never mind, I figured it out. I went with code like this: `df = pd.DataFrame(df.set_index(['PRS 3']).unstack(-1))` and then `df = df.transpose()`. – Adrian Keister Aug 13 '19 at 22:35
4

The following worked for me:

mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]
buddemat
  • 4,552
  • 14
  • 29
  • 49
0

thanks to gmoutso comment you can use this:

def multiindex_pivot(df, index=None, columns=None, values=None):
    if index is None:
        names = list(df.index.names)
        df = df.reset_index()
    else:
        names = index
    list_index = df[names].values
    tuples_index = [tuple(i) for i in list_index] # hashable
    df = df.assign(tuples_index=tuples_index)
    df = df.pivot(index="tuples_index", columns=columns, values=values)
    tuples_index = df.index  # reduced
    index = pd.MultiIndex.from_tuples(tuples_index, names=names)
    df.index = index
    return df

usage:

df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')

You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:

(df
   .infer_objects()  # coerce to the intended column type
   .rename_axis(None, axis=1))  # flatten column headers
moshevi
  • 4,999
  • 5
  • 33
  • 50