1

I have the following DataFrame:

import pandas as pd

df = pd.DataFrame({
    'col1': ['a, b'],
    'col2': [100]
}, index=['A'])

What I'd like to achieve is by "exploding" col1 to create a multi-level index with the values of col1 as the 2nd level - while retaining the value of col2 from the original index, eg:

idx_1,idx_2,val
A,a,100
A,b,100

I'm sure I need a col1.str.split(', ') in there, but I'm at a complete loss as to how to create the desired result - maybe I need a pivot_table but can't see how I can get that to get the required index.

I've spent a good hour and a half looking at the docs on re-shaping and pivoting etc... I'm sure it's straight-forward - I just have no idea of the terminology needed to find the "right thing".

Jon Clements
  • 138,671
  • 33
  • 247
  • 280

1 Answers1

6

Adapting the first answer here, this is one way. You might want to play around with the names to get those that you'd like.

If your eventual aim is to do this for very large dataframes, there may be more efficient ways to do this.

import pandas as pd
from pandas import Series

# Create test dataframe
df = pd.DataFrame({'col1': ['a, b'], 'col2': [100]}, index=['A'])

#split the values in column 1 and then stack them up in a big column
s = df.col1.str.split(', ').apply(Series, 1).stack()

# get rid of the last column from the *index* of this stack 
# (it was all meaningless numbers if you look at it)
s.index = s.index.droplevel(-1)

# just give it a name - I've picked yours from OP
s.name = 'idx_2'

del df['col1']  
df = df.join(s)
# At this point you're more or less there

# If you truly want 'idx_2' as part of the index - do this
indexed_df = df.set_index('idx_2', append=True)

Using your original dataframe as input, the code gives this as output:

>>> indexed_df
         col2
  idx_2
A a       100
  b       100

Further manipulations

If you want to give the indices some meaningful names - you can use

indexed_df.index.names = ['idx_1','idx_2']

Giving output

             col2
idx_1 idx_2
A     a       100
      b       100

If you really want the indices as flattened into columns use this

indexed_df.reset_index(inplace=True)

Giving output

>>> indexed_df
    idx_1 idx_2  col2
0       A     a   100
1       A     b   100
>>>

More complex input

If you try a slightly more interesting example input - e.g.

>>> df = pd.DataFrame({
...     'col1': ['a, b', 'c, d'],
...     'col2': [100,50]
... }, index = ['A','B'])

You get out:

>>> indexed_df
         col2
  idx_2
A a       100
  b       100
B c        50
  d        50
Community
  • 1
  • 1
J Richard Snape
  • 20,116
  • 5
  • 51
  • 79
  • @ixio - your edit was auto-rejected because I was also editing, but if you want to do it again I'll accept because you're absolutely right and you can get the rep for it :) – J Richard Snape May 20 '15 at 16:44