3

I have run into a property which I find peculiar about resampling Booleans in pandas. Here is some time series data:

import pandas as pd
import numpy as np

dr = pd.date_range('01-01-2020 5:00', periods=10, freq='H')
df = pd.DataFrame({'Bools':[True,True,False,False,False,True,True,np.nan,np.nan,False],
                   "Nums":range(10)},
                  index=dr)

So the data look like:

                     Bools  Nums
2020-01-01 05:00:00   True     0
2020-01-01 06:00:00   True     1
2020-01-01 07:00:00  False     2
2020-01-01 08:00:00  False     3
2020-01-01 09:00:00  False     4
2020-01-01 10:00:00   True     5
2020-01-01 11:00:00   True     6
2020-01-01 12:00:00    NaN     7
2020-01-01 13:00:00    NaN     8
2020-01-01 14:00:00  False     9

I would have thought I could do simple operations (like a sum) on the boolean column when resampling, but (as is) this fails:

>>> df.resample('5H').sum()

                    Nums
2020-01-01 05:00:00    10
2020-01-01 10:00:00    35

The "Bools" column is dropped. My impression of why this happens was b/c the dtype of the column is object. Changing that remedies the issue:

>>> r = df.resample('5H')
>>> copy = df.copy() #just doing this to preserve df for the example
>>> copy['Bools'] = copy['Bools'].astype(float)
>>> copy.resample('5H').sum()

                     Bools  Nums
2020-01-01 05:00:00    2.0    10
2020-01-01 10:00:00    2.0    35

But (oddly) you can still sum the Booleans by indexing the resample object without changing the dtype:

>>> r = df.resample('5H')
>>> r['Bools'].sum()

2020-01-01 05:00:00    2
2020-01-01 10:00:00    2
Freq: 5H, Name: Bools, dtype: int64

And also if the only column is the Booleans, you can still resample (despite the column still being object):

>>> df.drop(['Nums'],axis=1).resample('5H').sum()

                    Bools
2020-01-01 05:00:00      2
2020-01-01 10:00:00      2

What allows the latter two examples to work? I can see maybe they are a little more explicit ("Please, I really want to resample this column!"), but I don't see why the original resample doesn't allow the operation if it can be done.

Tom
  • 8,310
  • 2
  • 16
  • 36

2 Answers2

1

Well, tracking down shows that:

df.resample('5H')['Bools'].sum == Groupby.sum (in pd.core.groupby.generic.SeriesGroupBy)
df.resample('5H').sum == sum (in pandas.core.resample.DatetimeIndexResampler)

and tracking groupby_function in groupby.py shows that it's equivalent to r.agg(lambda x: np.sum(x, axis=r.axis)) where r = df.resample('5H') which outputs:

                     Bools  Nums  Nums2
2020-01-01 05:00:00      2    10     10
2020-01-01 10:00:00      2    35     35

well, actually, it should've been r = df.resample('5H')['Bool'] (only for the case above)

and tracking down the _downsample function in resample.py shows that it's equivalent to: df.groupby(r.grouper, axis=r.axis).agg(np.sum) which outputs:

                     Nums  Nums2
2020-01-01 05:00:00    10     10
2020-01-01 10:00:00    35     35
Partha Mandal
  • 1,391
  • 8
  • 14
  • Interesting! I guess these do boil down to slightly different things, thanks for tracking that down – Tom Jul 15 '20 at 14:39
  • Yes, I feared the distinction would be mere tautological (it still kinda feels like it is!) - but, the `lambda` there forces addition for each of the columns individually which the other version doesn't (`.agg(np.sum, axis=0)` does though) – Partha Mandal Jul 15 '20 at 15:05
0

df.resample('5H').sum() doesn't work on Bools column because the column has mixed data type, which is object in pandas. When calling sum() on resample or groupby, object typed columns will be ignored.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Yes, but I was more wondering why you can do `df.resample('5H')['Bools'].sum()` or `df.drop(['Nums'],axis=1).resample('5H').sum()`. The data are still type `object` in these cases, no? – Tom Jul 14 '20 at 21:30
  • 2
    There might be a flag somewhere to check if **all** columns are of `object` type. In which case, `sum` is forced on all columns, which also works on `list` type. So `df.astype('object').resample('5H').sum()` would just work as well. – Quang Hoang Jul 14 '20 at 21:39
  • I didn't know that would work, and would explain why the `drop` version works. Thanks for sharing this – Tom Jul 15 '20 at 14:33