1

I am looking to do something similar to the question raised here but the timedelta in my case is 2 year periods, i.e. data is reported in 2008, 2010, 2012 etc. and I would just like to know the start timestamp and end timestamp for each site and count the number of entries in between.

Input data:

Report_Year PWSID
2008-01-01 A
2008-01-01 B
2008-01-01 C
2008-01-01 D
2010-01-01 A
2010-01-01 B
2010-01-01 C
2012-01-01 A
2012-01-01 B
2016-01-01 A

This would result in a table like so:

Report_Year PWSID Count Start date End Date
2008-01-01 A 4 2008 2016
2008-01-01 B 3 2008 2012
2008-01-01 C 2 2008 2010
2008-01-01 D 1 2008 2008

I think I need to use a function like the following:

from dateutil.relativedelta import relativedelta
def yearsago(years, from_date=None):
    if from_date is None:
        from_date = datetime.now()
    return from_date - relativedelta(years=years)

But can't work out how to embed that within a groupby function as you might do if working with hours like so:

g=df.groupby('PWSID')['Report_Year'].diff().ne(pd.Timedelta(hours=1)).groupby(df['PWSID']).cumsum()

Thanks very much in advance!

AlexaB
  • 115
  • 1
  • 1
  • 6

1 Answers1

1

It seems in your case the data aggregation is simpler - grouping by PWSID and collecting count, min year, and max year in three columns can be achieved like this:

from io import StringIO
import pandas as pd

#recreating your data frame
data1 = """Report_Year     PWSID
2008-01-01     A
2008-01-01     B
2008-01-01     C
2008-01-01     D
2010-01-01     A
2010-01-01     B
2010-01-01     C
2012-01-01     A
2012-01-01     B
2016-01-01     A"""

df = pd.read_csv(StringIO(data1), delim_whitespace=True, parse_dates=["Report_Year"])


g = df.groupby("PWSID")["Report_Year"]
sum_df = g.agg(Count= "count").reset_index()
sum_df["Start_date"] = g.min().dt.year.values
sum_df["End_date"] = g.max().dt.year.values
print(sum_df)

Output:

  PWSID  Count  Start_date  End_date
0     A      4        2008      2016
1     B      3        2008      2012
2     C      2        2008      2010
3     D      1        2008      2008

P.S.: It felt rather tedious to do this in individual steps, so I asked for a better solution. There is indeed one:

...
df = pd.read_csv(StringIO(data1), delim_whitespace=True, parse_dates=["Report_Year"])

sum_df = df.assign(Year=pd.to_datetime(df['Report_Year']).dt.year).groupby('PWSID').agg(
    N=('PWSID', 'count'), Start_date=('Year', 'first'), End_date=('Year', 'last')).reset_index()
print(sum_df)
Mr. T
  • 11,960
  • 10
  • 32
  • 54