Merging two dataframes based on a date between two other dates without a common column

Question

I have two dataframes that I need to merge based on whether or not a date value fits in between two other dates. Basically, I need to perform an outer join where B.event_date is between A.start_date and A.end_date. It seems that merge and join always assume a common column which in this case, I do not have.

    A                           B
    start_date  end_date        event_date  price
0   2017-03-27  2017-04-20  0   2017-01-20  100
1   2017-01-10  2017-02-01  1   2017-01-27  200

Result 
    start_date  end_date        event_date  price
0   2017-03-27  2017-04-20  
1   2017-01-10  2017-02-01      2017-01-20  100
2   2017-01-10  2017-02-01      2017-01-27  200

Possible duplicate of https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range — June, Nov 08 '18 at 19:05
Solutions for larger databases can be found at https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others/42796283. — shahar_m, Feb 24 '19 at 16:30

Scott Boston · Accepted Answer · 2021-02-28T15:25:06.577

34

Create data and format to datetimes:

df_A = pd.DataFrame({'start_date':['2017-03-27','2017-01-10'],'end_date':['2017-04-20','2017-02-01']})
df_B = pd.DataFrame({'event_date':['2017-01-20','2017-01-27'],'price':[100,200]})

df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)

Create keys to do a cross join:

New in pandas 1.2.0+ how='cross' instead of assigning psuedo keys:

df_merge = df_A.merge(df_B, how='cross')

Else, with pandas < 1.2.0 use psuedo key to merge on 'key'

df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)

Filter out records that do not meet criteria of event dates between start and end dates:

df_merge = df_merge.query('event_date >= start_date and event_date <= end_date')

Join back to original date range table and drop key column

df_out = df_A.merge(df_merge, on=['start_date','end_date'], how='left').fillna('').drop('key', axis=1)

print(df_out)

Output:

              end_date           start_date           event_date price
0  2017-04-20 00:00:00  2017-03-27 00:00:00                           
1  2017-02-01 00:00:00  2017-01-10 00:00:00  2017-01-20 00:00:00   100
2  2017-02-01 00:00:00  2017-01-10 00:00:00  2017-01-27 00:00:00   200

edited Feb 28 '21 at 15:25

answered Apr 24 '17 at 17:09

Scott Boston

147,308
15
139
187

1

This is a great solution -- just what I needed. I can't up-vote it since I don't seem to have enough "reputation" points. Works great though. – P Spence Apr 26 '17 at 21:22
19

No this method will cause an explosuon of records due to the cartesian product caused by this join. – Scott Boston Jan 09 '18 at 16:54
@ScottBoston do you propose a better alternative? there doesn't seem to be a proper way to express this operation in pandas. – pablete Jan 26 '18 at 12:56
@pablete See this [SO post](https://stackoverflow.com/a/44601120/6361531) as a possible solution. – Scott Boston Jan 26 '18 at 15:38
1

great solution ! Just wndering, if with time any better efficient solutions have come up ? even the `numpy` based solution you refer to is in a way bursting out. Any new ideas? – joel.wilson Jan 31 '18 at 19:58
1

@joel.wilson No, I haven't had any new ideas on doing this merge/join. – Scott Boston Jan 31 '18 at 20:08
This solution does not work on large datasets. It will blow it out of proportion. – Starbucks Dec 06 '18 at 19:15
@Starbucks Yes, it cause a cartesian explosion in records as noted in [this comment above](https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common/43594038?noredirect=1#comment83325689_43594038). – Scott Boston Dec 06 '18 at 19:31
2

For future readers, I have added an answer for larger dataframe use cases [here](https://stackoverflow.com/a/54986850/3545273) – Serge Ballesta Mar 04 '19 at 15:55

score 1 · Answer 2 · answered Oct 11 '21 at 06:12

conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

Reusing @scottboston's data :

df_B.conditional_join(
        df_A, 
        ('event_date', 'start_date', '>='), 
        ('event_date', 'end_date', '<='), 
        how = 'right'
    )

        left             right
  event_date  price start_date   end_date
0        NaT    NaN 2017-03-27 2017-04-20
1 2017-01-20  100.0 2017-01-10 2017-02-01
2 2017-01-27  200.0 2017-01-10 2017-02-01

Under the hood, it uses np.searchsorted (binary search) to avoid the cartesian join.

Note that if the intervals do not overlap, pd.IntervalIndex is a more efficient solution.

Merging two dataframes based on a date between two other dates without a common column

2 Answers2

Linked