3

I am trying to group dates with a custom range using groupby and cut with no success so far. From the error message being returned, I wonder if cut is trying to process my dates as a number.

I want to group df1['date'] by custom date ranges and then sum the df1['HDD'] values. The custom ranges are found in df2:

import pandas as pd
df1 = pd.DataFrame( {'date': ['2/1/2015', '3/2/2015', '3/3/2015', '3/4/2015','4/17/2015','5/12/2015'],
                             'HDD' : ['7.5','8','5','23','11','55']})
    HDD  date
0   7.5 2/1/2015
1   8   3/2/2015
2   5   3/3/2015
3   23  3/4/2015
4   11  4/17/2015
5   55  5/12/2015

df2 has the custom date ranges:

df2 = pd.DataFrame( {'Period': ['One','Two','Three','Four'],
                     'Start Dates': ['1/1/2015','2/15/2015','3/14/2015','4/14/2015'],
                     'End Dates' : ['2/14/2015','3/13/2015','4/13/2015','5/10/2015']})

    Period  Start Dates End Dates
0   One     1/1/2015    2/14/2015
1   Two     2/15/2015   3/13/2015
2   Three   3/14/2015   4/13/2015
3   Four    4/14/2015   5/10/2015

My Desired output is to group df1 by the custom date ranges and aggregate the HDD values for each Period. Should output something like this:

   Period    HDD
0  One       7.5
1  Two       36
2  Three     0
3  Four      11

Here is one example of what I have tried to use custom grouping:

df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

...and here is the error I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-103-55ea779bcd73> in <module>()
----> 1 df3 = df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

/opt/conda/lib/python3.5/site-packages/pandas/tools/tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest)
    112     else:
    113         bins = np.asarray(bins)
--> 114         if (np.diff(bins) < 0).any():
    115             raise ValueError('bins must increase monotonically.')
    116 

/opt/conda/lib/python3.5/site-packages/numpy/lib/function_base.py in diff(a, n, axis)
   1576         return diff(a[slice1]-a[slice2], n-1, axis=axis)
   1577     else:
-> 1578         return a[slice1]-a[slice2]
   1579 
   1580 

TypeError: unsupported operand type(s) for -: 'str' and 'str'
  • Is cut trying to process my date ranges as numbers?
  • Do I need to explicitly convert my dates as datetime objects (tried this but maybe was going about it correctly)?

Thanks for any suggestions offered!

Adrien
  • 65
  • 2
  • 9
  • You should create 'Period' in df1 first and then calculate the groupby summary. – pe-perry Mar 02 '18 at 04:10
  • @kitman0804 - This is a minimal code example of what I'm trying to accomplish. In my actual problem, the data (represented as `df1` and `df2`in my question) are imported from external sources so I don't have the option to create them differently. Unless I'm misunderstanding what you're suggesting. – Adrien Mar 02 '18 at 04:18
  • I mean create the 'Period' variable in `df1` first, like people did [here](https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range), then do aggregation. – pe-perry Mar 02 '18 at 04:21
  • @kitman0804 ok, I see now what you mean. I'll take a look at that post. At first glance, it looks like it may work. I'll report back. Thanks! – Adrien Mar 02 '18 at 04:38
  • Possible duplicate: https://stackoverflow.com/questions/43500894/pandas-pd-cut-binning-datetime-column-series – spadarian Mar 02 '18 at 04:45
  • @kitman0804 - once I switched to Pandas v20.3, the solution that you referred to [here] (https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range) worked fine. I was initially using Pandas 19.2. – Adrien Mar 02 '18 at 12:43

1 Answers1

4

This works if you convert all your dates form dtype string to datetime.

df1['date'] = pd.to_datetime(df1['date'])

df2['End Dates'] = pd.to_datetime(df2['End Dates'])

df2['Start Dates'] = pd.to_datetime(df2['Start Dates'])

df1['HDD'] = df1['HDD'].astype(float)

df1.groupby(pd.cut(df1['date'], df2['Start Dates'])).agg({'HDD': sum})

Output:

                           HDD
date                          
(2015-01-01, 2015-02-15]   7.5
(2015-02-15, 2015-03-14]  36.0
(2015-03-14, 2015-04-14]   NaN

Adding labels:

df1.groupby(pd.cut(df1['date'], df2['Start Dates'], labels=df2.iloc[:-1,1])).agg({'HDD': sum})

Output:

        HDD
date       
One     7.5
Two    36.0
Three   NaN
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • I copied your code verbatim and tried to run it. This is the error that I received: `TypeError: Cannot cast ufunc less input from dtype(' – Adrien Mar 02 '18 at 05:47
  • I switched to another instance of iPython notebook with Pandas v 20.3 installed and this worked perfectly. Thanks! It looks like a lot of the solutions I've tried from stackoverflow were not working as posted because of the older version of Pandas (v19.2) which I've been using from the coursera course. – Adrien Mar 02 '18 at 12:41