OK, I think I figured something out for you. Here's a toy dataset:
import numpy as np
import pandas
import StringIO
from itertools import product
csvstring = StringIO.StringIO("""\
month,year,val1,flagA1,flagB1,val2,flagA2,flagB2,val3,flagA3,flagB3
12,2011,1,a1,b1,2,a2,b2,3,a3,b3
1,2012,4,a4,b4,5,a5,b5,6,a6,b6
2,2012,7,a7,b7,8,a8,b8,9,a9,b9
""")
df = pandas.read_csv(csvstring)
df['date'] = df.apply(lambda row: datetime.datetime(row['year'], row['month'], 1), axis=1)
df = df.set_index('date').drop(['month', 'year'], axis=1)
print(df.to_string())
val1 flagA1 flagB1 val2 flagA2 flagB2 val3 flagA3 flagB3
date
2011-12-01 1 a1 b1 2 a2 b2 3 a3 b3
2012-01-01 4 a4 b4 5 a5 b5 6 a6 b6
2012-02-01 7 a7 b7 8 a8 b8 9 a9 b9
(I dropped month
and year
and made date
the row index for a reason. You'll see why.)
Next, we need to create a MultiIndex
object that represent the current information stored in the columns. We now that columns come in sets of three: Value
, FlagA
, and FlagB
. So how many sets of columns do we have?
base_cols = ['value', 'FlagA', 'FlagB'] # the basic set we want
num_sets = df.columns.shape[0] / len(base_cols) # number of sets we currently have
The hierarchical index we need to create is the product of the base_cols
and range(1, num_sets+1)
.
newcols = pandas.MultiIndex.from_tuples([p for p in product(range(1, num_sets+1), base_cols)])
which gives us:
MultiIndex
[(1, u'value'), (1, u'FlagA'), (1, u'FlagB'), (2, u'value'), (2, u'FlagA'), (2, u'FlagB'), (3, u'value'), (3, u'FlagA'), (3, u'FlagB')]
Now just assign those new columns to the DataFrame:
df.columns = newcols
df.columns.names = ['set', 'results']
print(df.to_string())
set 1 1 1 2 2 2 3 3 3
results value FlagA FlagB value FlagA FlagB value FlagA FlagB
date
2011-12-01 1 a1 b1 2 a2 b2 3 a3 b3
2012-01-01 4 a4 b4 5 a5 b5 6 a6 b6
2012-02-01 7 a7 b7 8 a8 b8 9 a9 b9
Then if you stack
the dataframe, you get this:
stacked = df.stack(level='set')
print(stacked.to_string())
results FlagA FlagB value
date set
2011-12-01 1 a1 b1 1
2 a2 b2 2
3 a3 b3 3
2012-01-01 1 a4 b4 4
2 a5 b5 5
3 a6 b6 6
2012-02-01 1 a7 b7 7
2 a8 b8 8
3 a9 b9 9
At this point, I think that you're done. You can add the year
column back in if you want (stacked['year'] = stacked.apply(lambda row: row.name[0].year, axis=1)
) but with pandas, you don't really need to do that. The datetime index handling is quite powerful, as seen here: Pandas DataFrame slicing by day/hour/minute