2

I'm having problems with MultiIndex and stack(). The following example is based on a solution from Calvin Cheung on StackOvervlow.

=== multi.csv ===
  h1,main,h3,sub,h5
  a,A,1,A1,1
  b,B,2,B1,2
  c,B,3,A1,3
  d,A,4,B2,4
  e,A,5,B3,5
  f,B,6,A2,6

=== multi.py ===
  #!/usr/bin/env python

  import pandas as pd

  df1 = pd.read_csv('multi.csv')
  df2 = df1.pivot('main', 'sub').stack()
  print(df2)

=== output ===
           h1  h3  h5
  main sub
  A    A1   a   1   1
       B2   d   4   4
       B3   e   5   5
  B    A1   c   3   3
       A2   f   6   6
       B1   b   2   2

This works as long as the entries in the sub column are unique with respect to the corresponding entry in the main column. But if we change the sub column entry in row e to B2, then B2 is no longer unique in the group of A rows and we get an error message: "pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape".

I was expected the shape of the sub index to behave like the shape of the primary index where duplicates are indicated with blank entries under the first row entry.

=== expected output ===
           h1  h3  h5
  main sub
  A    A1   a   1   1
       B2   d   4   4
            e   5   5
  B    A1   c   3   3
       A2   f   6   6
       B1   b   2   2

So my question is, how can I structure a MultiIndex in a way that allows duplicates in sub-levels?

Community
  • 1
  • 1
highpost
  • 1,263
  • 2
  • 14
  • 25
  • Indexes (including MultiIndexes) are supposed to be unique. Think of it as like an index whole elements are tuples. It's okay if one level is nonunique, but the entire set of MultiIndex elements for a particular row has to be unique to that row. I think you need to rethink your structure. Why do you want these values to be indexes instead of just column values? – BrenBarn Jul 06 '13 at 01:13
  • Okay, I'll accept that I need to rethink MultiIndexes. I was hoping for something that was both intuitively readable and easy to write slices. But I'll look at this again and also see if I can improve the documentation. – highpost Jul 06 '13 at 03:42

1 Answers1

1

Rather than do a pivot*, just set_index directly (this works for both examples):

In [11]: df
Out[11]:
  h1 main  h3 sub  h5
0  a    A   1  A1   1
1  b    B   2  B1   2
2  c    B   3  A1   3
3  d    A   4  B2   4
4  e    A   5  B2   5
5  f    B   6  A2   6

In [12]: df.set_index(['main', 'sub'])
Out[12]:
         h1  h3  h5
main sub
A    A1   a   1   1
B    B1   b   2   2
     A1   c   3   3
A    B2   d   4   4
     B2   e   5   5
B    A2   f   6   6

*You're not really doing a pivot here anyway, it just happens to work in the above case.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • You can do this, but as described [here](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#series) it may raise an exception later if you try to do something that requires a unique index. Also as described in [this question](http://stackoverflow.com/questions/16626058/do-non-unique-indexes-provide-any-performance-advantage-in-pandas) you may suffer a performance hit. – BrenBarn Jul 06 '13 at 07:58