5

I want to aggregate a pandas.Series with an hourly DatetimeIndex to monthly values - while considering the offset to midnight.

Example

Consider the following (uniform) timeseries that spans about 1.5 months.

import pandas as pd
hours = pd.Series(1, pd.date_range('2020-02-23 06:00', freq = 'H', periods=1008))
hours
# 2020-02-23 06:00:00    1
# 2020-02-23 07:00:00    1
#                       ..
# 2020-04-05 04:00:00    1
# 2020-04-05 05:00:00    1
# Freq: H, Length: 1000, dtype: int64

I would like to sum these to months while considering, that days start at 06:00 in this use-case. The result should be:

2020-02-01 06:00:00    168
2020-03-01 06:00:00    744
2020-04-01 06:00:00     96
freq: MS, dtype: int64

How do I do that??


What I've tried and what works

  • I can aggregate to days while considering the offset, using the offset parameter:

    days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
    days
    # 2020-02-23 06:00:00    24
    # 2020-02-24 06:00:00    24
    # ..
    # 2020-04-03 06:00:00    24
    # 2020-04-04 06:00:00    24  
    # Freq: D, dtype: int64
    
  • Using the same method to aggregate to months does not work. The timestamps do not have a time component, and the values are incorrect:

    months = hours.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
    months
    # 2020-02-01    162 # wrong
    # 2020-03-01    744
    # 2020-04-01    102 # wrong
    # Freq: MS, dtype: int64
    
  • I could do the aggregation to months as a second step after aggregating to days. In that case, the values are correct, but the time component is still missing from the timestamps:

    days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
    months = days.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
    months
    # 2020-02-01    168
    # 2020-03-01    744
    # 2020-04-01     96
    # Freq: MS, dtype: int64
    
  • My current workaround is adding the timedelta and resetting the frequency manually.

    months.index += pd.Timedelta('06:00:00')
    months.index.freq = 'MS' 
    months
    # 2020-02-01 06:00:00    168
    # 2020-03-01 06:00:00    744
    # 2020-04-01 06:00:00     96
    # freq: MS, dtype: int64
    
ElRudi
  • 2,122
  • 2
  • 18
  • 33
  • 1
    If there is something I can do to clarify, e.g. the question or the use case, please let me know too – ElRudi Nov 21 '22 at 16:30
  • Does your current workaround(last use case with code and output) serve your purpose? – Mostafa Shamin Yeasar Dec 01 '22 at 06:21
  • Yes, it serves the purpose in the narrow use-case that I'm currently dealing with – ElRudi Dec 02 '22 at 14:15
  • The reason why the values are off is because the first day is being captured in the next month. This is because, for instance, `2020-02-01 06:00:00` has not yet _started_ at the beginning of the day, since the day starts at 00:00, and it _carries over into the next day_ which is why the last month has 1 extra day. – Mike Williamson Dec 06 '22 at 17:14

1 Answers1

1

Not too much of an improvement on your attempt, but you could write the resampling as

months = hours.resample('D', offset='06:00:00').sum().resample('MS').sum()

changing the index labels still requires the hack you've been doing, as in adding the time delta manually and setting freq to MS

note that you can pass a string representation of the time delta to offset.

The reason two resampling operations are needed is because when the resampling frequency is greater than 'D', the offset is ignored. Once your resample at the daily level is performed with the offset, the result can be further resampled without specifying the offset.

I believe this is buggy behaviour, and I agree with you that hours.resample('MS', offset='06:00:00').sum() should produce the expected result.

Essentially, there are two issues:

  1. the binning is incorrect when there is an offset applied & the frequency is greater than 'D'. The offset is ignored.
  2. the offset is not reflected in the final output, the output truncates to the start or end of the period. I'm not sure if the behaviour you're expecting can be generalized for all users.

That there is a related bug issue impacting resampling with offsets. I have not determined yet whether that and the issue you face have the same root cause. Its the same root cause.

Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • 1
    Thanks for your answer @haleemur-ali, even if, as you mentioned, it's not much of an improvement. I'm aware of that bug report; I filed it myself :):) I mainly did that because I was not getting answers here, and because am also convinced this is not expected/wanted behaviour. – ElRudi Nov 30 '22 at 16:38
  • since posting here, i looked at the code and know where to apply the fix. i have a test branch working, but we should get community approval on what the expected behaviour should be w.r.t. displaying the offsets. I suggest we keep having this discussion on github & once my fix is merged in, update the answer on SO. – Haleemur Ali Nov 30 '22 at 17:42