0

I'm trying to compare two dataframes and drop rows from the first dataframe that aren't between the dates in the second dataframe (or...selecting those rows that are between the dates in the 2nd dataframe). The selections should be inclusive. This might be really simple but its just not clicking for me right now.

Example data is below. For dataframe 1, this can be generated using daily data starting July 1 2018 and ending November 30 2018 with random numbers in the 'number' column. The ... in the dataframe 1 are meant used to show skipping data but the data is there in the real dataframe.

Dataframe 1:

               Number
Date
2018-07-01     15.2
2018-07-02     17.3
2018-07-03     19.5
2018-07-04     13.7
2018-07-05     19.1
...
2018-09-15     30.4
2018-09-16     25.7
2018-09-17     21.2
2018-09-18     19.7
2018-09-19     23.4
...
2018-11-01     30.8
2018-11-02     47.2
2018-11-03     25.3
2018-11-04     39.7
2018-11-05     43.8

Dataframe 2:

              Change
Date
2018-07-02     Start
2018-07-04     End
2018-09-16     Start
2018-09-18     End
2018-11-02     Start
2018-11-04     End

With the example above, the output should be:

               Number
Date
2018-07-02     17.3
2018-07-03     19.5
2018-07-04     13.7
2018-09-16     25.7
2018-09-17     21.2
2018-09-18     19.7
2018-11-02     47.2
2018-11-03     25.3
2018-11-04     39.7
cs95
  • 379,657
  • 97
  • 704
  • 746
Eric D. Brown D.Sc.
  • 1,896
  • 7
  • 25
  • 37

2 Answers2

2

You can try this, I hope the Start and End comes one after the other and is sorted.

df3 = pd.concat([df[i:j] for i,j in zip(df2.loc[df2['Change']=='Start'].index, df2.loc[df2['Change']=='End'].index)]))
             Number
Date              
2018-07-02    17.3
2018-07-03    19.5
2018-07-04    13.7
2018-09-16    25.7
2018-09-17    21.2
2018-09-18    19.7
2018-11-02    47.2
2018-11-03    25.3
2018-11-04    39.7
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
2

You can build an IntervalIndex from df2's index and search in logarithmic time.

df2.index = pd.to_datetime(df2.index)
idx = pd.IntervalIndex.from_arrays(df2.index[df.Change == 'Start'], 
                                   df2.index[df.Change == 'End'],
                                   closed='both')

df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]

            Number
Date              
2018-07-02    17.3
2018-07-03    19.5
2018-07-04    13.7
2018-09-16    25.7
2018-09-17    21.2
2018-09-18    19.7
2018-11-02    47.2
2018-11-03    25.3
2018-11-04    39.7
cs95
  • 379,657
  • 97
  • 704
  • 746
  • I like this, but I'm not getting there answer I expect to get on a larger dataset. Perhaps I've done something wrong. Can you explain the line `df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]` a bit? – Eric D. Brown D.Sc. Dec 12 '18 at 18:19
  • @EricD.Brown `idx.get_indexer` accepts a list of dates, and will return the index of the interval for each date, if it is present in any of them. If a date is not in any of the intervals, -1 is returned. You may also want to try `idx.get_indexer_non_unique`. See [this post](https://stackoverflow.com/q/46364710/4909087) for usage. – cs95 Dec 12 '18 at 18:22