How to group Pandas DataFrame dates into custom date range bins using groupby/cut

Question

I am trying to group dates with a custom range using groupby and cut with no success so far. From the error message being returned, I wonder if cut is trying to process my dates as a number.

I want to group df1['date'] by custom date ranges and then sum the df1['HDD'] values. The custom ranges are found in df2:

import pandas as pd
df1 = pd.DataFrame( {'date': ['2/1/2015', '3/2/2015', '3/3/2015', '3/4/2015','4/17/2015','5/12/2015'],
                             'HDD' : ['7.5','8','5','23','11','55']})
    HDD  date
0   7.5 2/1/2015
1   8   3/2/2015
2   5   3/3/2015
3   23  3/4/2015
4   11  4/17/2015
5   55  5/12/2015

df2 has the custom date ranges:

df2 = pd.DataFrame( {'Period': ['One','Two','Three','Four'],
                     'Start Dates': ['1/1/2015','2/15/2015','3/14/2015','4/14/2015'],
                     'End Dates' : ['2/14/2015','3/13/2015','4/13/2015','5/10/2015']})

    Period  Start Dates End Dates
0   One     1/1/2015    2/14/2015
1   Two     2/15/2015   3/13/2015
2   Three   3/14/2015   4/13/2015
3   Four    4/14/2015   5/10/2015

My Desired output is to group df1 by the custom date ranges and aggregate the HDD values for each Period. Should output something like this:

   Period    HDD
0  One       7.5
1  Two       36
2  Three     0
3  Four      11

Here is one example of what I have tried to use custom grouping:

df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

...and here is the error I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-103-55ea779bcd73> in <module>()
----> 1 df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

/opt/conda/lib/python3.5/site-packages/pandas/tools/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest)
    112     else:
    113         bins = np.asarray(bins)
--> 114         if (np.diff(bins) < 0).any():
    115             raise ValueError('bins must increase monotonically.')
    116 

/opt/conda/lib/python3.5/site-packages/numpy/lib/function_base.py in diff(a, n, axis)
   1576         return diff(a[slice1]-a[slice2], n-1, axis=axis)
   1577     else:
-> 1578         return a[slice1]-a[slice2]
   1579 
   1580 

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Is cut trying to process my date ranges as numbers?
Do I need to explicitly convert my dates as datetime objects (tried this but maybe was going about it correctly)?

Thanks for any suggestions offered!

You should create 'Period' in df1 first and then calculate the groupby summary. — pe-perry, Mar 02 '18 at 04:10
@kitman0804 - This is a minimal code example of what I'm trying to accomplish. In my actual problem, the data (represented as `df1` and `df2`in my question) are imported from external sources so I don't have the option to create them differently. Unless I'm misunderstanding what you're suggesting. — Adrien, Mar 02 '18 at 04:18
I mean create the 'Period' variable in `df1` first, like people did [here](https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range), then do aggregation. — pe-perry, Mar 02 '18 at 04:21
@kitman0804 ok, I see now what you mean. I'll take a look at that post. At first glance, it looks like it may work. I'll report back. Thanks! — Adrien, Mar 02 '18 at 04:38
Possible duplicate: https://stackoverflow.com/questions/43500894/pandas-pd-cut-binning-datetime-column-series — spadarian, Mar 02 '18 at 04:45
@kitman0804 - once I switched to Pandas v20.3, the solution that you referred to [here] (https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range) worked fine. I was initially using Pandas 19.2. — Adrien, Mar 02 '18 at 12:43

score 4 · Accepted Answer · answered Mar 02 '18 at 04:54

4

This works if you convert all your dates form dtype string to datetime.

df1['date'] = pd.to_datetime(df1['date'])

df2['End Dates'] = pd.to_datetime(df2['End Dates'])

df2['Start Dates'] = pd.to_datetime(df2['Start Dates'])

df1['HDD'] = df1['HDD'].astype(float)

df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

Output:

                           HDD
date                          
(2015-01-01, 2015-02-15]   7.5
(2015-02-15, 2015-03-14]  36.0
(2015-03-14, 2015-04-14]   NaN

Adding labels:

df1.groupby(pd.cut(df1['date'], df2['Start Dates'], labels=df2.iloc[:-1,1])).agg({'HDD': sum})

Output:

        HDD
date       
One     7.5
Two    36.0
Three   NaN

answered Mar 02 '18 at 04:54

Scott Boston

147,308
15
139
187

I copied your code verbatim and tried to run it. This is the error that I received: `TypeError: Cannot cast ufunc less input from dtype(' – Adrien Mar 02 '18 at 05:47
I switched to another instance of iPython notebook with Pandas v 20.3 installed and this worked perfectly. Thanks! It looks like a lot of the solutions I've tried from stackoverflow were not working as posted because of the older version of Pandas (v19.2) which I've been using from the coursera course. – Adrien Mar 02 '18 at 12:41

How to group Pandas DataFrame dates into custom date range bins using groupby/cut

1 Answers1