2

After grouping a dataframe by a certain column, I want to select the dates in the first six months of the year preceding the last value of each key of the grouped column. For example, in this dataframe

df = pd.DataFrame([
  {'id': 'thing 1', 'date': '2016-01-01', 'quantity': 1 },
  {'id': 'thing 1', 'date': '2016-02-01', 'quantity': 1 },
  {'id': 'thing 1', 'date': '2016-09-01', 'quantity': 1 },
  {'id': 'thing 1', 'date': '2016-10-01', 'quantity': 1 },
  {'id': 'thing 2', 'date': '2017-01-01', 'quantity': 2 },
  {'id': 'thing 2', 'date': '2017-02-01', 'quantity': 2 },
  {'id': 'thing 2', 'date': '2017-02-11', 'quantity': 2 },
  {'id': 'thing 2', 'date': '2017-09-01', 'quantity': 2 },
  {'id': 'thing 2', 'date': '2017-10-01', 'quantity': 2 },
])
df.date = pd.to_datetime(df.date, format="%Y-%m-%d")
df

I would like to select the first 2 lines of thing 1 and the the first 3 lines of thing 2.

I can find the begin and end date for each id:

df.groupby('id').date.max()-pd.DateOffset(years=1)
df.groupby('id').date.max()-pd.DateOffset(months=6)

but I cannot find a way to select the dates in between.


My best (and naive) attempt, which is probably wrong in many ways, was the following:

df[(df.groupby('id').date>(df.groupby('id').date.max()-pd.DateOffset(years=1))) & (df.groupby('id').date<(df.groupby('id').date.max()-pd.DateOffset(months=6)))]

which understandably returns

TypeError: Invalid comparison between dtype=datetime64[ns] and SeriesGroupBy

shamalaia
  • 2,282
  • 3
  • 23
  • 35

2 Answers2

2

You can use groupby.transform and get the highest date on each row. Then we substract 6 months with pd.DateOffset and finally compare it to each row with Series.lt which stands for less than, same as <:

max_date = df.groupby('id')['date'].transform('max') - pd.DateOffset(months=6)
df[df['date'].lt(max_date)]
# or df[df['date'] < max_date]

Output

        id       date  quantity
0  thing 1 2016-01-01         1
1  thing 1 2016-02-01         1
4  thing 2 2017-01-01         2
5  thing 2 2017-02-01         2
6  thing 2 2017-02-11         2
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • When I run it I get "Invalid comparison between dtype=float64 and Timestamp"... But it seems the right way! One caveat is that I also have to specify a min_date; I assume I can just do `.gt(min_date)`? – shamalaia Oct 28 '20 at 14:53
  • Make sure your columns are all `datetime` type. Yes first make a min_date series. If you need any help, let me know. – Erfan Oct 28 '20 at 14:55
  • I managed to do it but in two lines of code `df=df[df['date'].lt(max_date)]' and `df=df[df['date'].gt(min_date)]' where `min_date = df.groupby('id')['date'].transform('max') - pd.DateOffset(years=1)`. Is there a one-liner that works? I got a vast zoo of errors trying some variations of `df[(df['date'].lt(max_date)) and (df[df['date'].gt(min_date)])]` – shamalaia Oct 28 '20 at 15:04
  • Pandas uses `&` as logical operator, so do: `df[df['date'].lt(max_date) & df['date'].gt(min_date)]` – Erfan Oct 28 '20 at 15:06
  • ah! right... but now I get `The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().` – shamalaia Oct 28 '20 at 15:13
  • If you didnt use `lt` or `gt`, make sure to use brackets around your conditions, for example: `(df['date'] < max_date)`, maybe read [this](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) for your understanding – Erfan Oct 28 '20 at 15:14
1

This is maybe a daft way to do it but...You can add a new column into your df for month based on the existing dates.

df['MONTH'] = pd.DatetimeIndex(df['date']).month

It will make Jan 1, Feb 2 etc

Then filter off that column for the first six months where | means or

df = df.loc[df['MONTH'] == 1]|[df['MONTH'] == 2]|[df['MONTH'] == 3]|[df['MONTH'] == 4]|[df['MONTH'] == 5]|[df['MONTH'] == 6]|
Josh Fox
  • 91
  • 1
  • 7