1

I have a dataframe of temperature data, laid out with columns like this:

MONTH YEAR VALUE1 FLAGA1 FLAGB1 ... VALUE31 FLAGA31 FLAGB31

Where VALUEs and FLAGs refer to a day of the month, such that VALUE1 is the value for the first day of the month, VALUE2 for the second, and so on.

Obviously, this is a timeseries with a funky formatting (it's actually useful for working with parts of months, which is why it's like this). I'd like to end up with something like this:

DATE YEAR VALUE FLAGA FLAGB

(Yes, DATE and YEAR are redundant, but useful for filtering).

I don't really know how to get started on this. My natural inclination would be to iterate over the source frame and create a new structure, but I don't even know how to iterate over the values in the source frame. I also suspect that the Pandas / Numpy way is not to iterate individually.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • can you post some of the data in a copy/pastable form? my general advice is to create a hierarchical index on the columns and then simply use the `stack` method of the dataframe. – Paul H Jan 05 '14 at 16:48

1 Answers1

1

OK, I think I figured something out for you. Here's a toy dataset:

import numpy as np
import pandas
import StringIO
from  itertools import product

csvstring = StringIO.StringIO("""\
month,year,val1,flagA1,flagB1,val2,flagA2,flagB2,val3,flagA3,flagB3
12,2011,1,a1,b1,2,a2,b2,3,a3,b3
1,2012,4,a4,b4,5,a5,b5,6,a6,b6
2,2012,7,a7,b7,8,a8,b8,9,a9,b9
""")

df = pandas.read_csv(csvstring)
df['date'] = df.apply(lambda row: datetime.datetime(row['year'], row['month'], 1), axis=1)
df = df.set_index('date').drop(['month', 'year'], axis=1)
print(df.to_string())

            val1 flagA1 flagB1  val2 flagA2 flagB2  val3 flagA3 flagB3
date                                                                  
2011-12-01     1     a1     b1     2     a2     b2     3     a3     b3
2012-01-01     4     a4     b4     5     a5     b5     6     a6     b6
2012-02-01     7     a7     b7     8     a8     b8     9     a9     b9

(I dropped month and year and made date the row index for a reason. You'll see why.)

Next, we need to create a MultiIndex object that represent the current information stored in the columns. We now that columns come in sets of three: Value, FlagA, and FlagB. So how many sets of columns do we have?

base_cols = ['value', 'FlagA', 'FlagB']  # the basic set we want
num_sets = df.columns.shape[0] / len(base_cols)  # number of sets we currently have

The hierarchical index we need to create is the product of the base_cols and range(1, num_sets+1).

newcols = pandas.MultiIndex.from_tuples([p for p in product(range(1, num_sets+1), base_cols)])

which gives us:

MultiIndex
[(1, u'value'), (1, u'FlagA'), (1, u'FlagB'), (2, u'value'), (2, u'FlagA'), (2, u'FlagB'), (3, u'value'), (3, u'FlagA'), (3, u'FlagB')]

Now just assign those new columns to the DataFrame:

df.columns = newcols
df.columns.names = ['set', 'results']
print(df.to_string())
set             1     1     1      2     2     2      3     3     3
results     value FlagA FlagB  value FlagA FlagB  value FlagA FlagB
date                                                               
2011-12-01      1    a1    b1      2    a2    b2      3    a3    b3
2012-01-01      4    a4    b4      5    a5    b5      6    a6    b6
2012-02-01      7    a7    b7      8    a8    b8      9    a9    b9

Then if you stack the dataframe, you get this:

stacked = df.stack(level='set')
print(stacked.to_string())

results        FlagA FlagB  value
date       set                   
2011-12-01 1      a1    b1      1
           2      a2    b2      2
           3      a3    b3      3
2012-01-01 1      a4    b4      4
           2      a5    b5      5
           3      a6    b6      6
2012-02-01 1      a7    b7      7
           2      a8    b8      8
           3      a9    b9      9

At this point, I think that you're done. You can add the year column back in if you want (stacked['year'] = stacked.apply(lambda row: row.name[0].year, axis=1)) but with pandas, you don't really need to do that. The datetime index handling is quite powerful, as seen here: Pandas DataFrame slicing by day/hour/minute

Community
  • 1
  • 1
Paul H
  • 65,268
  • 20
  • 159
  • 136