3

I want to resample() my daily data into six-month chunks. However, I want the ends of the six-month chunks to be the ends of April and October. If I use df.resample('6M').sum() (or df.groupby(pd.Grouper(freq='6M').sum()), the end of the first six-month chunk is the end of the first month in the data. I know about anchored offsets, but I do not know how to create a custom anchored offset (e.g., '6M-APR' does not work).

Here is some example code:

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(
    data={'logret': np.random.randn(1000)},
    index=pd.date_range(start='2001-05-25', periods=1000, freq='B')
)
df.resample('6M').sum()

Which yields the following output:

    logret
2001-05-31  2.2950148716254297
2001-11-30  -12.536360930670858
2002-05-31  5.468848462868161
2002-11-30  13.027927629740189
2003-05-31  -10.37282118563155
2003-11-30  -0.156275418330286
2004-05-31  -3.0768727498370905
2004-11-30  28.328856464071546
2005-05-31  -3.6462613215100546

I have not achieved my goal (six-month resampling that ends in April and October) with the start, offset, and loffset arguments to .resample().

I have achieved my goal with the hack below. However, it loses the date index, and I would like a more robust/repeatable approach.

def sixmonth(d, b=4):
    y, m, h = d.year, d.month, 1
    if (m > (b + 6)): y += 1
    elif (m > b): h += 1

    return y + h/10

df.groupby(sixmonth).sum()

Which yields the following output without a date:

    logret
2001.2  -10.300839024148
2002.1  9.321994034984547
2002.2  8.855517878860585
2003.1  -2.4576797445001493
2003.2  -7.002919570231796
2004.1  -9.36895555474087
2004.2  27.13038641177464
2005.1  3.154551390326532

Of course, I could improve this hack. But is there a better/robust/repeatable solution for n-period resampling that ends in arbitrary months?

Richard Herron
  • 9,760
  • 12
  • 69
  • 116

3 Answers3

3

Another workaround, keeping the datetime index:

def custom_6M(df, month=4):
    df = df.resample("M").sum()
    df = df.rolling(6).sum()
    return df[df.index.month.isin([month,month+6])]

>>> custom_6M(df)
               logret
2001-10-31 -10.300839
2002-04-30   9.321994
2002-10-31   8.855518
2003-04-30  -2.457680
2003-10-31  -7.002920
2004-04-30  -9.368956
2004-10-31  27.130386
not_speshal
  • 22,093
  • 2
  • 15
  • 30
1

It's a pain. When I needed something similar, I ended up with the following approach:

anchor_month = 4
non_months = (anchor_month + 3) % 12, (anchor_month + 9) % 12
df = df.resample('Q-APR').sum()
df = (df.reset_index()
        .groupby(df.index.month.isin(non_months).cumsum())
        .agg({'index': 'last', 'logret': 'sum'})
        .set_index('index'))

Result here:

               logret
index                
2001-10-31 -10.300839
2002-04-30   9.321994
2002-10-31   8.855518
2003-04-30  -2.457680
2003-10-31  -7.002920
2004-04-30  -9.368956
2004-10-31  27.130386
2005-04-30   3.154551

But the problem is, that sometimes the last index doesn't fit (okay here). That can be fixed by another '6M'-resample. Overall: Not pretty.

Timus
  • 10,974
  • 5
  • 14
  • 28
0

Thanks for the answers. I have two more options.

Append a time-stamped series to df to anchor the six-month resampling periods

I hoped that .resample()'s origin argument would let me manually anchor my six-month resampling periods. It doesn't, but the following code does.

df.append(pd.Series(name=pd.to_datetime('2001-04-30'), dtype='float')).resample('6M').sum()

Improve my sixmonth() function to use timestamps

def sixmonth(d, m=6, n=4):
    o = (m - (d.month - n)) % m
    return d + pd.offsets.MonthEnd(o)

I first .resample('M') to make sure that I have end-of-month dates. I could modify sixmonth() to check for end-of-month dates, but I'm more afraid of finding some new edge case than a little inefficiency.

df.resample('M').sum().groupby(sixmonth).sum()
Richard Herron
  • 9,760
  • 12
  • 69
  • 116