pandas groupby half hourly timeseries data

Question

I have a timeseries of half hourly electricity data that looks like this:

                Date_Time  Metered Electricity (MWh)
0     2016-03-27 00:00:00                   8.644511
1     2016-03-27 00:30:00                   6.808402
2     2016-03-27 01:00:00                   6.507068
3     2016-03-27 01:30:00                   5.271631
4     2016-03-27 02:00:00                   2.313497
...                   ...                        ...
58122 2019-06-30 11:30:00                   8.051935
58123 2019-06-30 12:00:00                   3.520226
58124 2019-06-30 12:30:00                   5.093964

I want to average all of the data points into an average for each half hourly timestep, ultimately so I can create a graph showing the average electricity produced throughout the day.

I've managed to do this for the hourly data using groupby which works fine:

mean_hourly = energy_2018.groupby(energy_2018["Date_Time"].dt.hour).mean()

which I can use if I can't work out how to do groupby half hourly, but it would mean I am missing out on half of all the data. Any idea how to use groupby half hourly so I can use all of the data?

Thank you!

infinite789 · Accepted Answer · 2021-12-02T03:14:44.253

You can make groups and group by hour and minute. Since you have only recorded data in intervals of half an hour you get one distinct group per hour and per 30 minutes.

import pandas as pd
df = pd.DataFrame({
    'time': ['2016-03-27 00:00:00',
            '2016-03-27 00:00:00',                  
            '2016-03-27 00:30:00',                  
            '2016-03-27 01:00:00',                   
            '2016-03-27 01:30:00',                 
            '2019-06-30 11:30:00',                   
            '2019-06-30 12:00:00',                 
            '2019-06-30 12:30:00'],
    'electricity': [8.644511,
                    6.808402,
                    6.507068,
                    5.271631,
                    2.313497,
                    8.051935,
                    3.520226,
                    5.093964]
})

df['time'] = pd.to_datetime(df['time'])
df['minutes'] = df['time'].apply(lambda x: x.minute)
df['hour'] = df['time'].apply(lambda x: x.hour)
df.groupby(['hour', 'minutes']).mean()

Output:

EDIT:

As indicated in Quang Hoang's comment, a better way to get the hours and the minutes would be

df['minutes'] = df['time'].dt.minute
df['hour'] = df['time'].dt.hour

It is better to use existing solutions from the standard library, to increase readability and performance. On the other hand lambda expressions do provide some flexibility and can be quite useful at times. If you are interested you can read more here:

Why are Python lambdas useful?

https://www.code-learner.com/advantages-and-disadvantages-of-lambda-expressions-in-python-and-their-usage-scenarios/

`df['time'].dt.hour` and dt['time'].dt.minute` would be a better way to get the hours and the minutes. — Quang Hoang, Dec 02 '21 at 02:47
Thanks @DataFace, I've never really seen a lambda function in action and actually understood what was going on but now I have! But thank you both regardless, this achieved exactly what I wanted it to. — Luke Hickinbotham, Dec 02 '21 at 03:49
There seem to be 3 answers that all work with the input data provided from OP. For data that has breaks between times, if you need those 'in-between' gaps then @Wilian resample method is best; if you don't need flexibility and potentially modifying your groupby columns (i.e. hour, minute is fine but what if groupby hour, minute, second, then you have to create additional columns to get to the extra detail(s) or time resolution required so this is less flexible) but it works then Wilian's solution works. If you want wide flexibility of x-units of y-timeperiods then pd.Grouper is most flexible. — Chris, Dec 02 '21 at 05:20
After my post, I learned that lambda expressions are quite controversial, but nevertheless I would like to point out that you can do `df['time'].apply(lambda x: 0 if x.minute < 30 else 1)` in case your data was not recorded on fixed intervals to get 2 half hourly groups. Of course I agree that the resample method is much more general and clear. — infinite789, Dec 02 '21 at 22:29

score 2 · Answer 2 · answered Dec 02 '21 at 02:37

df = pd.read_excel('test.xlsx')
print(df)

output:

            Date_Time  Metered Electricity (MWh)
0 2016-03-27 00:00:00                          1
1 2016-03-27 00:29:00                          2
2 2016-03-27 00:59:00                          3
3 2016-03-27 00:57:00                          4
4 2016-03-27 02:00:00                          5

Then do this:

df.set_index('Date_Time',inplace=True)
df = df.resample("30T").mean().reset_index()
print(df)

output:

            Date_Time  Metered Electricity (MWh)
0 2016-03-27 00:00:00                        1.5
1 2016-03-27 00:30:00                        3.5
2 2016-03-27 01:00:00                        NaN
3 2016-03-27 01:30:00                        NaN
4 2016-03-27 02:00:00                        5.0

EDIT

or just this:

df = df.set_index('Date_Time').resample("30T").mean().reset_index()

Ah I see what you were doing, but my data was already half-hourly so I didn't need it resampled. I just wanted the mean of each half hourly time throughout the entire dataset. — Luke Hickinbotham, Dec 02 '21 at 03:52
df['mean 30 minutes'] = df.groupby(pd.Grouper(key='Date_Time', freq='30min')).transform('mean') — Wilian, Dec 02 '21 at 11:39

score 0 · Answer 3 · answered Dec 02 '21 at 02:41

0

energy_2018.groupby(energy_2018["Date_Time"].astype('int64') / 1e9 % 86400 / 3600).mean()

answered Dec 02 '21 at 02:41

Ming

479
2
11

Chris · Answer 4 · 2021-12-02T03:05:43.457

To produce an aggregated value that is the same for all rows within that 'grouped aggregated time', use a combination of pd.groupby and pd.Grouper. The trick to return the aggregated value to every row instead of a reshaped dataframe is to use the transform function afterward (I used numpy's mean (np.mean) so be sure to import numpy as np in the code). If you need more specific flexibility (something like every 10 minutes but start the first minute at some offset minute instead of zero (e.g. 00:02:00 rather than 00:00:00) then additional parameters are needed.

import numpy as np

energy_2018["mean_hourly"] = energy_2018.groupby(pd.Grouper(key="Date_Time", freq="60Min")).transform(np.mean)

Date_Time               Metered Electricity (MWh)  mean_hourly
2016-03-27 00:03:00                   8.644511     7.726456
2016-03-27 00:31:00                   6.808402     7.726456
2016-03-27 01:00:00                   6.507068     5.889349
2016-03-27 01:30:00                   5.271631     5.889349
2016-03-27 02:00:00                   2.313497     2.313497

pandas groupby half hourly timeseries data

4 Answers4