Resample Pandas With Minimum Required Number of Observations

Question

I'm having trouble figuring out how to resample a pandas date-time indexed dataframe, but require a minimum number of values in order to give a value. I'd like to resample daily data to monthly, and require at least 90% of values to be present to yield a value.

With an input of daily data:

import pandas as pd
rng = pd.date_range('1/1/2011', periods=365, freq='D')
ts = pd.Series(pd.np.random.randn(len(rng)), index=rng)
ts['2011-01-01':'2011-01-05']=pd.np.nan #a short length of NANs to timeseries
ts['2011-10-03':'2011-10-30']=pd.np.nan #add ~ month long length of NANs to timeseries

that has only a few NANs in January, but almost a full month of NANs in October, I'd like the output of my monthly resampling sum:

ts.resample('M').sum()

to give a NAN for october (> 90% of daily data missing), and value for January (< 90% of data missing), instead of the current output:

2011-01-31    11.949479
2011-02-28    -1.730698
2011-03-31    -0.141164
2011-04-30    -0.291702
2011-05-31    -1.996223
2011-06-30    -1.936878
2011-07-31     5.025407
2011-08-31    -1.344950
2011-09-30    -2.035502
2011-10-31    -2.571338
2011-11-30   -13.492956
2011-12-31     7.100770

I've read this post, using rolling mean and min_periods; I'd prefer to keep using resample for its direct time-indexing use. Is this possible? I have not been able to find much in the resample docs or stack overflow to address this.

[This post](https://stackoverflow.com/questions/43556344/pandas-monthly-rolling-operation) illustrates how resample is able to use calendar months, but "rolling", with min_periods option, cannot. — EHB, Feb 27 '18 at 23:02

root · Answer 1 · 2018-02-27T23:47:31.820

Get both the sum and a count of non-null values when you use resample, then use the non-null count to alter the sum as appropriate:

# resample getting a sum and non-null count
ts = ts.resample('M').agg(['sum', 'count'])

# determine invalid months
invalid = ts['count'] <= 0.1 * ts.index.days_in_month

# restrict to the sum and null out invalid entries
ts = ts['sum']
ts[invalid] = np.nan

Alternatively, you can write a custom sum function that does this filtering internally, though it might not be as efficient on large datasets:

def sum_valid_obs(x):
    min_obs = 0.1 * x.index[0].days_in_month
    valid_obs = x.notnull().sum()
    if valid_obs < min_obs:
        return np.nan
    return x.sum()


ts = ts.resample('M').apply(sum_valid_obs)

The resulting output for either method:

2011-01-31     3.574859
2011-02-28     2.907705
2011-03-31   -10.060877
2011-04-30     3.270250
2011-05-31    -3.492617
2011-06-30    -1.855461
2011-07-31    -7.363193
2011-08-31     0.128842
2011-09-30    -9.509890
2011-10-31          NaN
2011-11-30     0.543561
2011-12-31     3.354250
Freq: M, Name: sum, dtype: float64

score 4 · Accepted Answer · answered Mar 23 '20 at 20:51

With a recent pandas version (from the docs I would say starting with v0.22.0) you can just use the min_count keyword argument:

import pandas as pd

rng = pd.date_range('1/1/2011', periods=365, freq='D')
ts = pd.Series(pd.np.random.randn(len(rng)), index=rng)
ts['2011-01-01':'2011-01-05'] = pd.np.nan #a short length of NANs to timeseries
ts['2011-10-03':'2011-10-30'] = pd.np.nan #add ~ month long length of NANs to timeseries

ts.resample('M').sum(min_count=20)

Output

2011-01-31     8.000269
2011-02-28    -6.648587
2011-03-31    10.593682
2011-04-30    -1.214945
2011-05-31     4.259289
2011-06-30    -5.986097
2011-07-31    -6.612820
2011-08-31    -1.073952
2011-09-30    -2.164976
2011-10-31          NaN
2011-11-30     1.912070
2011-12-31    12.101526
Freq: M, dtype: float64

Do note that the min_count argument is only valid for certain pandas resampler method (like sum, min, max). Using the min_count argument in other methods (like mean, std) will raise an UnSupportedFunctionCall. — westr, Nov 19 '21 at 09:17

Resample Pandas With Minimum Required Number of Observations

2 Answers2