0

I'm trying to code as efficiently and pythonicly as possible (that is avoiding loops whenever possible) and I've found this and this but they seem to be irrelevant to my question.

Consider a pd.DataFrame called df with 30 years of hourly data for 28 countries:

  • index: datetime64[ns] (e.g. '1980-01-01 00:00:00') and in
  • columns: a 2-digit countrycode of each of the 28 countries.

e.g.

countrycode                AT        BE        [...]
time                                   
1980-01-01 00:00:00  0.016947  0.059076
1980-01-01 01:00:00  0.019461  0.068575
[...]
2009-31-12 23:00:00  0.119735  0.287495

In order to create yearly boxplots for every country separately, e.g. via ds[country].plot(kind='box'), I intend to transform this DataFrame into the following form:

  • columns: pd.MultiIndex such as (countrycode, year) e.g. ('AT', 1980), ('AT', 1981), ...
  • index: range(8760) e.g. (0, 1, ... 8759)

The piece of code I came up with works well, but its damn slow due to a nested loop:

years = sorted(set(df.index.year))
ds = pd.DataFrame(columns=pd.MultiIndex
                            .from_product([df.columns, years]))
for reg, col in df.iteritems():
    for year, group in col.groupby(by=df.index.year):
        ds.loc[:, (reg, year)] = group.reset_index(drop=True)

Any ideas to code this part more efficiently are highly welcome :)

1 Answers1

0
df.groupby(['countrycode','year']).size().unstack().plot(kind='box')
learner
  • 119
  • 1
  • 2
  • 12
  • Thanks, but this does not work, as neither 'year' (as part of `datetime64[ns]`-index) nor 'countrycode' (as name of the columns) are valid keys. – stack4science Mar 23 '18 at 12:03