Pandas: select DF rows based on another DF

Question

I've got two dataframes (very long, with hundreds or thousands of rows each). One of them, called df1, contains a timeseries, in intervals of 10 minutes. For example:

               date          value
2016-11-24 00:00:00    1759.199951
2016-11-24 00:10:00     992.400024
2016-11-24 00:20:00    1404.800049
2016-11-24 00:30:00      45.799999
2016-11-24 00:40:00      24.299999
2016-11-24 00:50:00     159.899994
2016-11-24 01:00:00      82.499999
2016-11-24 01:10:00      37.400003
2016-11-24 01:20:00     159.899994
....

And the other one, df2, contains datetime intervals:

              start_date             end_date
0    2016-11-23 23:55:32  2016-11-24 00:14:03
1    2016-11-24 01:03:18  2016-11-24 01:07:12
2    2016-11-24 01:11:32  2016-11-24 02:00:00 
...

I need to select all the rows in df1 that "falls" into an interval in df2.

With these examples, the result dataframe should be:

               date          value
2016-11-24 00:00:00    1759.199951   # Fits in row 0 of df2
2016-11-24 00:10:00     992.400024   # Fits in row 0 of df2
2016-11-24 01:00:00      82.499999   # Fits in row 1 of df2
2016-11-24 01:10:00      37.400003   # Fits on row 2 of df2
2016-11-24 01:20:00     159.899994   # Fits in row 2 of df2
....

I'm afraid I can't. I thought I had a valid example, though slow, but I realized it's wrong, so I deleted it from the OP, sorry. — César García Tapia, Dec 07 '16 at 15:56
@César García Tapia, Shouldn't the resulting dataframe consist of just 3 rows in total(1,2,5). (3 and 4 do not fall in the given interval)? — Nickil Maveli, Dec 11 '16 at 19:30
@NickilMaveli Yes, they do. They're 10 minutes intervals, so "01:00:00" goes to "01:09:59", that fits in the 01:03-01:07 interval. — César García Tapia, Dec 11 '16 at 20:43
From what I see, the intervals in `df2` are disjoint and sorted. Can you confirm? And what happens if one 10-minute interval from `df1` overlaps two intervals in `df2` - is it selected twice? — ptrj, Dec 12 '16 at 23:56

root · Answer 1 · 2016-12-16T17:03:37.140

Using `np.searchsorted`:

Here's a variation based on np.searchsorted that seems to be an order of magnitude faster than using intervaltree or merge, assuming my larger sample data is correct.

# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])

# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
    df1['date'].values <= s1['end_date'].values,
    df1['date_end'].values <= s2['end_date'].values,
    s1.index.values != s2.index.values
    ]

# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)

This may need to be modified if the intervals in df2 are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.

Using an Interval Tree

Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2, and querying it against your intervals in df1 to find the ones that overlap.

The intervaltree package on PyPI seems to have good performance and easy to use syntax.

from intervaltree import IntervalTree

# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree package was built with pd.Timestamp in mind, so there probably some intermediate conversion steps that slow things down a bit.

Also, note that intervals in the intervaltree package do not include the end point, although the start point is included. That's why I have the + [0, 1] when creating tree; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10) to get the interval end when querying the tree, instead of adding only 9m 59s.

The resulting output for either method:

                 date        value
0 2016-11-24 00:00:00  1759.199951
1 2016-11-24 00:10:00   992.400024
6 2016-11-24 01:00:00    82.499999
7 2016-11-24 01:10:00    37.400003
8 2016-11-24 01:20:00   159.899994

Timings

Using the following setup to produce larger sample data:

# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})

# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})

# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

Which yields the following for df1 and df2:

df1
                  date     value
0     2016-11-24 00:00:00  0.444939
1     2016-11-24 00:10:00  0.407554
2     2016-11-24 00:20:00  0.460148
3     2016-11-24 00:30:00  0.465239
4     2016-11-24 00:40:00  0.462691
...
54995 2017-12-10 21:50:00  0.754123
54996 2017-12-10 22:00:00  0.401820
54997 2017-12-10 22:10:00  0.146284
54998 2017-12-10 22:20:00  0.394759
54999 2017-12-10 22:30:00  0.907233

df2
              start_date            end_date
0   2016-11-24 00:00:19 2016-11-24 00:41:24
1   2016-11-24 18:22:44 2016-11-24 18:36:44
2   2016-11-25 12:44:44 2016-11-25 13:03:13
3   2016-11-26 07:07:05 2016-11-26 07:49:29
4   2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53

And using the following functions for timing purposes:

def root_searchsorted(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # Build the conditions that indicate an overlap (any True condition indicates an overlap).
    cond = [
        df1['date'].values <= s1['end_date'].values,
        df1['date_end'].values <= s2['end_date'].values,
        s1.index.values != s2.index.values
        ]

    # Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
    return df1[np.any(cond, axis=0)].drop('date_end', axis=1)

def root_intervaltree(df1, df2):
    # Build the Interval Tree.
    tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

    # Build the 10 minutes spans from df1.
    dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

    # Query the Interval Tree to filter the DataFrame.
    return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

def ptrj(df1, df2):
    # The smallest amount of time - handy when using open intervals:
    epsilon = pd.Timedelta(1, 'ns')

    # Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
    sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
    edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values

    return df1[mask]

def parfait(df1, df2):
    df1['key'] = 1
    df2['key'] = 1
    df2['row'] = df2.index.values

    # CROSS JOIN
    df3 = pd.merge(df1, df2, on=['key'])

    # DF FILTERING
    return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

I get the following timings:

%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop

%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop

%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop

%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop

%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

This works, and I like the idea, but the last step ("Query the Interval Tree to filter df1") is very very slow. Doing it with a merge, as @Parfait suggests, takes about 25 seconds. With `intervaltree`, it takes about 75. — César García Tapia, Dec 12 '16 at 12:15
What are the lengths of `df1` and `df2` that you're testing on? I was getting better performance than the `merge` variation when I did some tests on medium sized data. I tried testing against a larger dataset, but didn't have enough memory to get the `merge` version to run, so I didn't get a benchmark to compare against. — root, Dec 12 '16 at 14:20
`df1` is 50-55 thousand rows, and `df2` is smaller, a few hundred rows. — César García Tapia, Dec 12 '16 at 16:16
I can't reproduce your timings. I've added my timings with `df1` having 55k rows and `df2` having 500 rows. Does my sample data look reasonable? Is there anything different between the sample data you provided and your actual data, e.g. nested or overlapping intervals in `df2`? What versions of python and pandas are you using? Also, what version of sortedcontainers (used by intervaltree under the hood) are you using? — root, Dec 12 '16 at 20:29
Added another method based on `np.searchsorted` that appears to be an order of magnitude faster. — root, Dec 13 '16 at 00:38
We'll go with the searchsorted solution. It is really fast. Thank you SO much, @root!!! — César García Tapia, Dec 16 '16 at 16:44
No problem. I've also added timings for @KrysotL's modified solution. — root, Dec 16 '16 at 17:04

ptrj · Answer 2 · 2016-12-13T23:51:44.300

This solution (I believe it works) uses pandas.Series.asof. Under the hood, it's some version of searchsorted - ~~but for some reason it's four times faster than~~ it's comparable in speed with @root's function.

I assume that all date columns are in the pandas datetime format, sorted, and that df2 intervals are non-overlapping.

The code is pretty short but somewhat intricate (explanation below).

# The smallest amount of time - handy when using open intervals: 
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

# The main function (see explanation below):
def get_it(df1):
    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values
    return df1[mask]

The advantage of this approach is twofold: sdate and edate are evaluated only once and the main function can take chunks of df1 if df1 is very large.

Explanation

pandas.Series.asof returns the last valid row for a given index. It can take an array as an input and is quite fast.

For the sake of this explanation, let s[j] = sdate.index[j] be the jth date in sdate and x be some arbitrary date (timestamp). There is always s[sdate.asof(x)] <= x (this is exactly how asof works) and it's not difficult to show that:

j <= sdate.asof(x) if and only if s[j] <= x
sdate.asof(x) < j if and only if x < s[j]

Similarly for edate. Unfortunately, we can't have the same inequalities (either week or strict) in both 1. and 2.

Two intervals [a, b) and [x, y] intersect iff x < b and a <= y. (We may think of a, b as coming from sdate.index and edate.index - the interval [a, b) is chosen to be closed-open because of properties 1. and 2.) In our case x is a date from df1, y = x + 10min - epsilon, a = s[j], b = e[j] (note that epsilon has been added to edate), where j is some number.

So, finally, the condition equivalent to "[a, b) and [x, y] intersect" is "sdate.asof(x) < j and j <= edate.asof(y) for some number j". And it roughly boils down to l < r inside the function get_it (modulo some technicalities).

I've added timings for your solution to my answer. I also improved the performance of my `np.searchsorted` solution, so it's slightly faster now. — root, Dec 13 '16 at 17:49

score 2 · Answer 3 · answered Dec 11 '16 at 20:59

This is not exactly straightforward but you can do the following:

First get the relevant date columns from the two dataframes and concatenate them together so that one column is all the dates and the other two are columns representing the indexes from df2. (Note that df2 gets a multiindex after stacking)

dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)

print(dfm)

                    0  level_0     level_1
0 2016-11-23 23:55:32      0.0  start_date
0 2016-11-24 00:00:00      NaN         NaN
1 2016-11-24 00:10:00      NaN         NaN
1 2016-11-24 00:14:03      0.0    end_date
2 2016-11-24 00:20:00      NaN         NaN
3 2016-11-24 00:30:00      NaN         NaN
4 2016-11-24 00:40:00      NaN         NaN
5 2016-11-24 00:50:00      NaN         NaN
6 2016-11-24 01:00:00      NaN         NaN
2 2016-11-24 01:03:18      1.0  start_date
3 2016-11-24 01:07:12      1.0    end_date
7 2016-11-24 01:10:00      NaN         NaN
4 2016-11-24 01:11:32      2.0  start_date
8 2016-11-24 01:20:00      NaN         NaN
5 2016-11-24 02:00:00      2.0    end_date

You can see that the values from df1 have NaN in the right two columns and since we have sorted the dates, these rows fall in between the start_date and end_date rows (from df2).

In order to indicate that the rows from df1 fall between the rows from df2 we can interpolate the level_0 column which gives us:

dfm['level_0'] = dfm['level_0'].interpolate()

                    0   level_0     level_1
0 2016-11-23 23:55:32  0.000000  start_date
0 2016-11-24 00:00:00  0.000000         NaN
1 2016-11-24 00:10:00  0.000000         NaN
1 2016-11-24 00:14:03  0.000000    end_date
2 2016-11-24 00:20:00  0.166667         NaN
3 2016-11-24 00:30:00  0.333333         NaN
4 2016-11-24 00:40:00  0.500000         NaN
5 2016-11-24 00:50:00  0.666667         NaN
6 2016-11-24 01:00:00  0.833333         NaN
2 2016-11-24 01:03:18  1.000000  start_date
3 2016-11-24 01:07:12  1.000000    end_date
7 2016-11-24 01:10:00  1.500000         NaN
4 2016-11-24 01:11:32  2.000000  start_date
8 2016-11-24 01:20:00  2.000000         NaN
5 2016-11-24 02:00:00  2.000000    end_date

Notice that the level_0 column now contains integers (mathematically, not the data type) for the rows that fall between a start date and an end date (this assumes that an end date will not overlap the following start date).

Now we can just filter out the rows originally in df1:

df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']

And merge back with the original dataframe

df_final = pd.merge(df1, right=df_falls, on='date', how='outer')

which gives:

print(df_final)

                 date        value  falls_index
0 2016-11-24 00:00:00  1759.199951          0.0
1 2016-11-24 00:10:00   992.400024          0.0
2 2016-11-24 00:20:00  1404.800049          NaN
3 2016-11-24 00:30:00    45.799999          NaN
4 2016-11-24 00:40:00    24.299999          NaN
5 2016-11-24 00:50:00   159.899994          NaN
6 2016-11-24 01:00:00    82.499999          NaN
7 2016-11-24 01:10:00    37.400003          NaN
8 2016-11-24 01:20:00   159.899994          2.0

Which is the same as the original dataframe with the extra column falls_index which indicates the index of the row in df2 that that row falls into.

score 2 · Answer 4 · answered Dec 12 '16 at 02:23

Consider a cross join merge that returns the cartesian product between both sets (all possible row pairings M x N). You can cross join using an all 1's key column in merge's on argument. Then, run a filter on large returned set using pd.series.between(). Specifically, the series between() keeps rows where start date falls within the 9:59 range of date or date falls within start and end times.

However, prior to the merge, create a df1['date'] column equal to the date index so it can be a retained column after merge and used for date filtering. Additionally, create a df2['row'] column to be used as row indicator at the end. For demo, below recreates posted df1 and df2 dataframes:

from io import StringIO
import pandas as pd
import datetime as dt

data1 = '''
date                     value
"2016-11-24 00:00:00"    1759.199951
"2016-11-24 00:10:00"     992.400024
"2016-11-24 00:20:00"    1404.800049
"2016-11-24 00:30:00"      45.799999
"2016-11-24 00:40:00"      24.299999
"2016-11-24 00:50:00"     159.899994
"2016-11-24 01:00:00"      82.499999
"2016-11-24 01:10:00"      37.400003
"2016-11-24 01:20:00"     159.899994
'''    
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values

data2 = '''
start_date  end_date
"2016-11-23 23:55:32"  "2016-11-24 00:14:03"
"2016-11-24 01:03:18"  "2016-11-24 01:07:12"
"2016-11-24 01:11:32"  "2016-11-24 02:00:00"
'''    
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])

# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])

# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
          (df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

print(df3)
#                            value  row
# date                                 
# 2016-11-24 00:00:00  1759.199951    0
# 2016-11-24 00:10:00   992.400024    0
# 2016-11-24 01:00:00    82.499999    1
# 2016-11-24 01:10:00    37.400003    2
# 2016-11-24 01:20:00   159.899994    2

This is correct, and it's the first thing I tried, but I'm dealing with a year span on 10min intervals, so this is extremely slow and memory hungry. — César García Tapia, Dec 12 '16 at 09:48
Would have helped to know and see your trial! I mirrored this off SQL running a cross join and `WHERE` clause filter with `BETWEEN` operator all in same query. So, consider a RDMS solution (SQLite, MySQL, Postgres) indexing date fields. — Parfait, Dec 12 '16 at 14:59

score 2 · Answer 5 · answered Dec 16 '16 at 11:12

I tried to modify the @root's code with the experimental query pandas method see. It should be faster than the original implementation for very large dataFrames. For small dataFrames it will be definitely slower.

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

I've added timings for your modified solution to my answer. My solution is slightly faster. Using `query` is usually faster than the equivalent `pandas` comparisons, however, the comparisons my solution uses is directly on the underlying `numpy` arrays, which is why I suspect it's still slightly faster than `query`. — root, Dec 16 '16 at 17:06

Pandas: select DF rows based on another DF

5 Answers5

Using `np.searchsorted`:

Using an Interval Tree

Timings

Linked

Related

Pandas: select DF rows based on another DF

5 Answers5

Using np.searchsorted:

Using an Interval Tree

Timings

Linked

Related

Using `np.searchsorted`: