I am looking to do something similar to the question raised here but the timedelta in my case is 2 year periods, i.e. data is reported in 2008, 2010, 2012 etc. and I would just like to know the start timestamp and end timestamp for each site and count the number of entries in between.
Input data:
Report_Year | PWSID |
---|---|
2008-01-01 | A |
2008-01-01 | B |
2008-01-01 | C |
2008-01-01 | D |
2010-01-01 | A |
2010-01-01 | B |
2010-01-01 | C |
2012-01-01 | A |
2012-01-01 | B |
2016-01-01 | A |
This would result in a table like so:
Report_Year | PWSID | Count | Start date | End Date |
---|---|---|---|---|
2008-01-01 | A | 4 | 2008 | 2016 |
2008-01-01 | B | 3 | 2008 | 2012 |
2008-01-01 | C | 2 | 2008 | 2010 |
2008-01-01 | D | 1 | 2008 | 2008 |
I think I need to use a function like the following:
from dateutil.relativedelta import relativedelta
def yearsago(years, from_date=None):
if from_date is None:
from_date = datetime.now()
return from_date - relativedelta(years=years)
But can't work out how to embed that within a groupby function as you might do if working with hours like so:
g=df.groupby('PWSID')['Report_Year'].diff().ne(pd.Timedelta(hours=1)).groupby(df['PWSID']).cumsum()
Thanks very much in advance!