After grouping a dataframe by a certain column, I want to select the dates in the first six months of the year preceding the last value of each key of the grouped column. For example, in this dataframe
df = pd.DataFrame([
{'id': 'thing 1', 'date': '2016-01-01', 'quantity': 1 },
{'id': 'thing 1', 'date': '2016-02-01', 'quantity': 1 },
{'id': 'thing 1', 'date': '2016-09-01', 'quantity': 1 },
{'id': 'thing 1', 'date': '2016-10-01', 'quantity': 1 },
{'id': 'thing 2', 'date': '2017-01-01', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-02-01', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-02-11', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-09-01', 'quantity': 2 },
{'id': 'thing 2', 'date': '2017-10-01', 'quantity': 2 },
])
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
df
I would like to select the first 2 lines of thing 1
and the the first 3 lines of thing 2
.
I can find the begin and end date for each id
:
df.groupby('id').date.max()-pd.DateOffset(years=1)
df.groupby('id').date.max()-pd.DateOffset(months=6)
but I cannot find a way to select the dates in between.
My best (and naive) attempt, which is probably wrong in many ways, was the following:
df[(df.groupby('id').date>(df.groupby('id').date.max()-pd.DateOffset(years=1))) & (df.groupby('id').date<(df.groupby('id').date.max()-pd.DateOffset(months=6)))]
which understandably returns
TypeError: Invalid comparison between dtype=datetime64[ns] and SeriesGroupBy