0

I have two pandas dataframes as following:

df1:

id  date        item
3   2015-11-23  B
3   2015-11-23  A
3   2016-05-11  C
3   2017-02-01  C
3   2018-07-12  E
4   2014-05-11  C
4   2015-02-01  C
4   2018-07-12  E

df2

id  start       end            
3   2016-05-11  2017-08-30
4   2015-01-11  2017-08-22

I would like to cut df1 such that I only keep items of df1 which falls within the date ranges given in df2:

id  date        item
3   2016-05-11  C
3   2017-02-01  C
4   2015-02-01  C

In reality, df1 and df2 are of millions of rows and therefore, I won't be able to do any quick fixes using for loops for example. I have rough idea of using groupby by id, but I am afraid all my tries have failed so far.

Thank you in advance!

soarfy
  • 45
  • 1
  • 5
  • Can you pleas update the OP with the code you have tried ? – error404 Mar 07 '22 at 08:58
  • Does this answer your question? [Select DataFrame rows between two dates](https://stackoverflow.com/questions/29370057/select-dataframe-rows-between-two-dates) – mpx Mar 07 '22 at 09:02
  • @rpb it is very similar, but I am afraid they have a set start and end date, however, here for each id I have a distinct start and end date. – soarfy Mar 07 '22 at 09:24

1 Answers1

3

The basic way is to build a dataframe containing all possible events for that id. You can then filter on whether that event is between your two dates.

df3 = df1.merge(df2, how='inner', left_on='id', right_on='id')

df3[(df3['date'] <= df3['end']) & (df3['date'] >= df3['date'])]
el_oso
  • 1,021
  • 6
  • 10