I'm trying to code as efficiently and pythonicly as possible (that is avoiding loops whenever possible) and I've found this and this but they seem to be irrelevant to my question.
Consider a pd.DataFrame
called df
with 30 years of hourly data for 28 countries:
- index:
datetime64[ns]
(e.g.'1980-01-01 00:00:00'
) and in - columns: a 2-digit countrycode of each of the 28 countries.
e.g.
countrycode AT BE [...]
time
1980-01-01 00:00:00 0.016947 0.059076
1980-01-01 01:00:00 0.019461 0.068575
[...]
2009-31-12 23:00:00 0.119735 0.287495
In order to create yearly boxplots for every country separately, e.g. via ds[country].plot(kind='box')
, I intend to transform this DataFrame into the following form:
- columns:
pd.MultiIndex
such as(countrycode, year)
e.g. ('AT', 1980), ('AT', 1981), ... - index:
range(8760)
e.g. (0, 1, ... 8759)
The piece of code I came up with works well, but its damn slow due to a nested loop:
years = sorted(set(df.index.year))
ds = pd.DataFrame(columns=pd.MultiIndex
.from_product([df.columns, years]))
for reg, col in df.iteritems():
for year, group in col.groupby(by=df.index.year):
ds.loc[:, (reg, year)] = group.reset_index(drop=True)
Any ideas to code this part more efficiently are highly welcome :)