This is similar to MaxU's answer, but shows a way to groupby both the date columns and the non-date columns (RegionName
, State
).
import numpy as np
import pandas as pd
df = pd.DataFrame({'2000-01': [53100], '2000-02': [53200], '2000-03': [53400], 'RegionName': ['Philadelphia'], 'State': ['PA']})
melted = pd.melt(df, id_vars=['RegionName', 'State'], var_name='date')
melted['date'] = pd.PeriodIndex(melted['date'], freq='Q')
result = melted.groupby(['RegionName', 'State', 'date']).mean()
result = result['value'].unstack('date')
yields
date 2000Q1
RegionName State
Philadelphia PA 53233
First, use pd.melt
to coalesce all the date columns into a single column:
import numpy as np
import pandas as pd
df = pd.DataFrame({'2000-01': [53100], '2000-02': [53200], '2000-03': [53400], 'RegionName': ['Philadelphia'], 'State': ['PA']})
melted = pd.melt(df, id_vars=['RegionName', 'State'], var_name='date')
# RegionName State date value
# 0 Philadelphia PA 2000-01 53100
# 1 Philadelphia PA 2000-02 53200
# 2 Philadelphia PA 2000-03 53400
Next, use pd.PeriodIndex
to convert the date (strings?) to pd.Period
s. Notice that pd.PeriodIndex
can convert different date strings to the same Period:
melted['date'] = pd.PeriodIndex(melted['date'], freq='Q')
# RegionName State date value
# 0 Philadelphia PA 2000Q1 53100
# 1 Philadelphia PA 2000Q1 53200
# 2 Philadelphia PA 2000Q1 53400
Finally, use groupby/mean
to group together rows with the same RegionName
, State
and date
and compute the mean for each group:
result = melted.groupby(['RegionName', 'State', 'date']).mean()
# value
# RegionName State date
# Philadelphia PA 2000Q1 53233
If you stop here, your DataFrame will be tidy (PDF) -- each row represents one "observation". This is often the best form for further computation. (It isn't by accident that first step in this solution was to convert the original DataFrame to tidy format -- notice that melted
also has the property that each row represents one observation.)
However, if you'd like, you could move the date
index level into separate columns:
result = result['value'].unstack('date')