Merging dataframes based on date range

Question

I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) with a Date. I want to merge df1 and df2 based on df2.Date between df1.StartDate and df2.EndDate.

Each date range in df1 is unique and doesn't overlap with any of the other rows in the dataframe.

Dates are formatted YYYY-MM-DD.

You'll have to define a func to perform the lookup, `merge` will not perform the match for you as it requires exact matches on value rather than between ranges — EdChum, Jul 09 '15 at 20:58
See related: http://stackoverflow.com/questions/21367485/merge-pandas-dataframes-based-on-irregular-time-intervals?rq=1 — EdChum, Jul 09 '15 at 21:00
Could you please accept my answer if that helps you solve the question? Many thanks. :-) — Jianxun Li, Jul 10 '15 at 13:00

Jianxun Li · Answer 1 · 2015-07-10T12:35:44.127

Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.

import pandas as pd
import numpy as np

# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})

df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

df1.StartDate

Out[139]: 
0   2010-01-01
1   2010-01-06
2   2010-01-11
3   2010-01-16
4   2010-01-21
5   2010-01-26
6   2010-01-31
7   2010-02-05
8   2010-02-10
Name: StartDate, dtype: datetime64[ns]

df2.date_time

Out[140]: 
0    2010-01-01
1    2010-01-02
2    2010-01-03
3    2010-01-04
4    2010-01-05
5    2010-01-06
6    2010-01-07
7    2010-01-08
8    2010-01-09
9    2010-01-10
        ...    
40   2010-02-10
41   2010-02-11
42   2010-02-12
43   2010-02-13
44   2010-02-14
45   2010-02-15
46   2010-02-16
47   2010-02-17
48   2010-02-18
49   2010-02-19
Name: date_time, dtype: datetime64[ns]


df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)


Out[143]: 
    date_time  values  ID_matched
0  2010-01-01 -0.2240           1
1  2010-01-02 -0.4202           1
2  2010-01-03  0.9998           1
3  2010-01-04  0.4310           1
4  2010-01-05 -0.6509           0
5  2010-01-06 -1.4987           2
6  2010-01-07 -1.2306           2
7  2010-01-08  0.1940           2
8  2010-01-09 -0.9984           2
9  2010-01-10 -0.3676           0
..        ...     ...         ...
40 2010-02-10  0.5242           9
41 2010-02-11  0.3451           9
42 2010-02-12  0.7244           9
43 2010-02-13 -2.0404           9
44 2010-02-14 -1.0798           0
45 2010-02-15 -0.6934           0
46 2010-02-16 -2.3380           0
47 2010-02-17  1.6623           0
48 2010-02-18 -0.2754           0
49 2010-02-19 -0.7466           0

[50 rows x 3 columns]

%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 µs per loop

Jianxun, thank you for the answer. Is there a way to do this with `datetime64` Series in pandas DataFrames? — Nicholas Tulach, Jul 10 '15 at 12:14
@NicholasTulach i think pandas will treat datetime as datetime64 as a default behaviour. so it should work on datetime64 series as well. if you see any error, could you please upload your sample datafile via a dropbox sharelink so that i can take a look at what might go wrong. — Jianxun Li, Jul 10 '15 at 12:20
@NicholasTulach if you run the above sample code, you can check the dtype of index, it's datetime64. — Jianxun Li, Jul 10 '15 at 12:22
When i change the code to use an arbitrary column instead of an index, it gives a ValueError. e.g. `df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2['date'] >= start_date)&(df2['date'] <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)` — Nicholas Tulach, Jul 10 '15 at 12:27
@NicholasTulach I think that's because `df2['date']` is a dataframe column, so it has the index with it. If you use `df2['date'].values`, then the `ValueError` should disappear. I've updated my sample code as well. Please give a look. — Jianxun Li, Jul 10 '15 at 12:37
This is a great solution. I can't accept it but I upvoted it. — Jesse Blocher, Jun 04 '18 at 14:32
Any specific reasons why you wouldn't just use df.index instead of creating the ID column in df1? — Joe Rivera, Feb 18 '21 at 21:51

Mark_Anderson · Answer 2 · 2019-10-07T20:31:50.863

Minor correction to @JianxunLi answer. Bit too involved for a comment.

This uses the len(funclist) == len(condlist) + 1 property of piecewise to assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...

### Data / inits
import pandas as pd
import numpy as np

df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

### Processing
valIfNoMatch = np.nan
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
                                     [(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
                                     np.append(df1.ID.values, valIfNoMatch))

PS. Also corrected the typo testing both >= & <=; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.

score 1 · Answer 3 · answered Oct 11 '21 at 06:03

conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

Reusing jianxun-li's data:

np.random.seed(123)

df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 
                    'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 
                    'ID': np.arange(1, 10, 1)})

df2 = pd.DataFrame(dict(values=np.random.randn(50),
                        date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

df2.conditional_join(
      df1, 
      ('date_time', 'StartDate', '>='), 
      ('date_time', 'EndDate', '<=')
   )

        left                 right
      values  date_time  StartDate    EndDate ID
0  -1.085631 2010-01-01 2010-01-01 2010-01-04  1
1   0.997345 2010-01-02 2010-01-01 2010-01-04  1
2   0.282978 2010-01-03 2010-01-01 2010-01-04  1
3  -1.506295 2010-01-04 2010-01-01 2010-01-04  1
4   1.651437 2010-01-06 2010-01-06 2010-01-09  2
5  -2.426679 2010-01-07 2010-01-06 2010-01-09  2
6  -0.428913 2010-01-08 2010-01-06 2010-01-09  2
7   1.265936 2010-01-09 2010-01-06 2010-01-09  2
8  -0.678886 2010-01-11 2010-01-11 2010-01-14  3
9  -0.094709 2010-01-12 2010-01-11 2010-01-14  3
10  1.491390 2010-01-13 2010-01-11 2010-01-14  3
11 -0.638902 2010-01-14 2010-01-11 2010-01-14  3
12 -0.434351 2010-01-16 2010-01-16 2010-01-19  4
13  2.205930 2010-01-17 2010-01-16 2010-01-19  4
14  2.186786 2010-01-18 2010-01-16 2010-01-19  4
15  1.004054 2010-01-19 2010-01-16 2010-01-19  4
16  0.737369 2010-01-21 2010-01-21 2010-01-24  5
17  1.490732 2010-01-22 2010-01-21 2010-01-24  5
18 -0.935834 2010-01-23 2010-01-21 2010-01-24  5
19  1.175829 2010-01-24 2010-01-21 2010-01-24  5
20 -0.637752 2010-01-26 2010-01-26 2010-01-29  6
21  0.907105 2010-01-27 2010-01-26 2010-01-29  6
22 -1.428681 2010-01-28 2010-01-26 2010-01-29  6
23 -0.140069 2010-01-29 2010-01-26 2010-01-29  6
24 -0.255619 2010-01-31 2010-01-31 2010-02-03  7
25 -2.798589 2010-02-01 2010-01-31 2010-02-03  7
26 -1.771533 2010-02-02 2010-01-31 2010-02-03  7
27 -0.699877 2010-02-03 2010-01-31 2010-02-03  7
28 -0.173636 2010-02-05 2010-02-05 2010-02-08  8
29  0.002846 2010-02-06 2010-02-05 2010-02-08  8
30  0.688223 2010-02-07 2010-02-05 2010-02-08  8
31 -0.879536 2010-02-08 2010-02-05 2010-02-08  8
32 -0.805367 2010-02-10 2010-02-10 2010-02-13  9
33 -1.727669 2010-02-11 2010-02-10 2010-02-13  9
34 -0.390900 2010-02-12 2010-02-10 2010-02-13  9
35  0.573806 2010-02-13 2010-02-10 2010-02-13  9

Under the hood it uses np.searchsorted (binary search).

Note that pd.IntervalIndex is a more efficient option, especially when the intervals are not overlapping.

Merging dataframes based on date range

3 Answers3

Linked

Related