Group by hour whilst taking account of daylight savings

Question

I am looking at shift data of a factory that works 24 hours a day. I want to group the data at each shift change which is 6:00 and 18:00. Up till now I have been trying to it with:

Data_Frame.groupby([pd.Grouper(freq='12H')]).count()

However I have realised that since freq is set to 12H, it will always take a period of 12 hours including during daylight savings.

Unfortunately it is always 6:00 and 18:00 even when the clocks change. That means in reality there is one shift in the year that is 11 hours long and another that is 13 hours long so in the middle of the year group is off by 1 hour.

I feel that this is such a fundamental thing (daylight savings) that there should be some way of telling pandas that it needs to take account of daylight savings.

I have tried changing it from UTC to Europe/London however it still takes 12 hours periods.

Many Thanks

edit:

Only way I have found to do this is, before using groupby is to split my data into 3 (before first hour change, during hour change, second hour change) use groupby on each individually then putting them back together but this is irritating and tedious so anything better than this is hugely appreciated.

What do you want to see in the groups? Do you want to see eleven hour and 13 hour groups on those days that it changes or do you always want to see twelve hour groups? — wwii, Jan 05 '19 at 16:10
Could you include a couple of examples of the time Series maybe six hours before and after the fall and spring changes? Does the data have *missing* and *duplicate* hours during that period? e.g. `00:00,01:00,03:00,...` and `00:00,01:00,01:00, 02:00,...`? — wwii, Jan 05 '19 at 16:15
Did you see this SO Q&A - [Flag Daylight Saving Time (DST) Hours in Pandas Date-Time Column](https://stackoverflow.com/q/52467759/2823755) — wwii, Jan 05 '19 at 16:27
Are you starting with raw data - you might want to post date/time coulmn(s) (?) for those same periods also. — wwii, Jan 05 '19 at 18:17

wwii · Accepted Answer · 2019-01-05T22:21:59.540

Hourly and 10 minute time-zone-aware time series' spanning spring dst change:

ts_hrly = pd.date_range('03-10-2018', '3-13-2018', freq='H', tz='US/Eastern')
ts_10m = pd.date_range('03-10-2018', '3-13-2018', freq='10T', tz='US/Eastern')

Use the hourly data

ts = ts_hrly
df = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})

The dst transition looks like this:

>>> df[18:23]
    period                    tstamp
18      18 2018-03-11 00:00:00-05:00
19      19 2018-03-11 01:00:00-05:00
20      20 2018-03-11 03:00:00-04:00
21      21 2018-03-11 04:00:00-04:00
22      22 2018-03-11 05:00:00-04:00
>>>

To group into twelve hourly increments on 06:00 and 18:00 boundaries I assigned each observation to a shift number then grouped by the shift number

My data conveniently starts at a shift change so calculate elapsed time since that first shift change:

nanosec = df['tstamp'].values - df.iloc[0,1].value

Find the shift changes and use np.cumsum() to assign shift numbers

shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12)  == 0
df['shift_nbr'] = shift_change.cumsum()
gb = df.groupby(df['shift_nbr'])
for k,g in gb:
    print(f'{k} has {len(g)} items')

>>>
1 has 12 items
2 has 12 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items

I haven't found a way to compensate for data starting in the middle of a shift.

If you want the groups for shifts affected by dst changes to have 11 or 13 items, change the timezone aware series to a timezone naive series

df2 = pd.DataFrame({'tstamp':pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')),'period':range(len(ts))})

Use the same process to assign and group by shift numbers

nanosec = df2['tstamp'].values - df2.iloc[0,1].value
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12)  == 0
df2['shift_nbr'] = shift_change.cumsum()

for k,g in gb2:
    print(f'{k} has {len(g)} items')

>>>
1 has 12 items
2 has 11 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
7 has 1 items

Unfortunately, pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')) takes some time. Here is a faster/better way to do it using the hour attribute of the timestamps to calculate elapsed hours - no need to create a separate timezone naive series, the hour attribute appears to be unaware. It also works for data starting in the middle of a shift.

ts = pd.date_range('01-01-2018 03:00', '01-01-2019 06:00', freq='H', tz='US/Eastern')
df3 = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})

shift_change = ((df3['tstamp'].dt.hour - 6) % 12) == 0
shift_nbr = shift_change.cumsum()

gb3 = df3.groupby(shift_nbr)

print(sep,'gb3')
for k,g in gb3:
    if len(g) != 12:
        print(f'shift starting {g.iloc[0,1]} has {len(g)} items')

>>>
shift starting 2018-01-01 03:00:00-05:00 has 3 items
shift starting 2018-03-10 18:00:00-05:00 has 11 items
shift starting 2018-11-03 18:00:00-04:00 has 13 items
shift starting 2019-01-01 06:00:00-05:00 has 1 items

Group by hour whilst taking account of daylight savings

1 Answers1