Select rows that lie within datetime intervals

Question

I'm trying to compare two dataframes and drop rows from the first dataframe that aren't between the dates in the second dataframe (or...selecting those rows that are between the dates in the 2nd dataframe). The selections should be inclusive. This might be really simple but its just not clicking for me right now.

Example data is below. For dataframe 1, this can be generated using daily data starting July 1 2018 and ending November 30 2018 with random numbers in the 'number' column. The ... in the dataframe 1 are meant used to show skipping data but the data is there in the real dataframe.

Dataframe 1:

               Number
Date
2018-07-01     15.2
2018-07-02     17.3
2018-07-03     19.5
2018-07-04     13.7
2018-07-05     19.1
...
2018-09-15     30.4
2018-09-16     25.7
2018-09-17     21.2
2018-09-18     19.7
2018-09-19     23.4
...
2018-11-01     30.8
2018-11-02     47.2
2018-11-03     25.3
2018-11-04     39.7
2018-11-05     43.8

Dataframe 2:

              Change
Date
2018-07-02     Start
2018-07-04     End
2018-09-16     Start
2018-09-18     End
2018-11-02     Start
2018-11-04     End

With the example above, the output should be:

               Number
Date
2018-07-02     17.3
2018-07-03     19.5
2018-07-04     13.7
2018-09-16     25.7
2018-09-17     21.2
2018-09-18     19.7
2018-11-02     47.2
2018-11-03     25.3
2018-11-04     39.7

score 2 · Accepted Answer · answered Dec 12 '18 at 16:33

2

You can try this, I hope the Start and End comes one after the other and is sorted.

df3 = pd.concat([df[i:j] for i,j in zip(df2.loc[df2['Change']=='Start'].index, df2.loc[df2['Change']=='End'].index)]))
             Number
Date              
2018-07-02    17.3
2018-07-03    19.5
2018-07-04    13.7
2018-09-16    25.7
2018-09-17    21.2
2018-09-18    19.7
2018-11-02    47.2
2018-11-03    25.3
2018-11-04    39.7

answered Dec 12 '18 at 16:33

Bharath M Shetty

30,075
6
57
108

This gives me the answer I am expecting to see on a larger dataset and its very pythonic. Thanks! – Eric D. Brown D.Sc. Dec 12 '18 at 18:20

score 2 · Answer 2 · answered Dec 12 '18 at 16:48

2

You can build an IntervalIndex from df2's index and search in logarithmic time.

df2.index = pd.to_datetime(df2.index)
idx = pd.IntervalIndex.from_arrays(df2.index[df.Change == 'Start'], 
                                   df2.index[df.Change == 'End'],
                                   closed='both')

df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]

            Number
Date              
2018-07-02    17.3
2018-07-03    19.5
2018-07-04    13.7
2018-09-16    25.7
2018-09-17    21.2
2018-09-18    19.7
2018-11-02    47.2
2018-11-03    25.3
2018-11-04    39.7

answered Dec 12 '18 at 16:48

cs95

379,657
97
704
746

I like this, but I'm not getting there answer I expect to get on a larger dataset. Perhaps I've done something wrong. Can you explain the line `df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]` a bit? – Eric D. Brown D.Sc. Dec 12 '18 at 18:19
@EricD.Brown `idx.get_indexer` accepts a list of dates, and will return the index of the interval for each date, if it is present in any of them. If a date is not in any of the intervals, -1 is returned. You may also want to try `idx.get_indexer_non_unique`. See [this post](https://stackoverflow.com/q/46364710/4909087) for usage. – cs95 Dec 12 '18 at 18:22

Select rows that lie within datetime intervals

2 Answers2