26

I have two dataframes that I need to merge based on whether or not a date value fits in between two other dates. Basically, I need to perform an outer join where B.event_date is between A.start_date and A.end_date. It seems that merge and join always assume a common column which in this case, I do not have.

    A                           B
    start_date  end_date        event_date  price
0   2017-03-27  2017-04-20  0   2017-01-20  100
1   2017-01-10  2017-02-01  1   2017-01-27  200

Result 
    start_date  end_date        event_date  price
0   2017-03-27  2017-04-20  
1   2017-01-10  2017-02-01      2017-01-20  100
2   2017-01-10  2017-02-01      2017-01-27  200
Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
P Spence
  • 401
  • 1
  • 5
  • 10

2 Answers2

34

Create data and format to datetimes:

df_A = pd.DataFrame({'start_date':['2017-03-27','2017-01-10'],'end_date':['2017-04-20','2017-02-01']})
df_B = pd.DataFrame({'event_date':['2017-01-20','2017-01-27'],'price':[100,200]})

df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)

Create keys to do a cross join:

New in pandas 1.2.0+ how='cross' instead of assigning psuedo keys:

df_merge = df_A.merge(df_B, how='cross')

Else, with pandas < 1.2.0 use psuedo key to merge on 'key'

df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)

Filter out records that do not meet criteria of event dates between start and end dates:

df_merge = df_merge.query('event_date >= start_date and event_date <= end_date')

Join back to original date range table and drop key column

df_out = df_A.merge(df_merge, on=['start_date','end_date'], how='left').fillna('').drop('key', axis=1)

print(df_out)

Output:

              end_date           start_date           event_date price
0  2017-04-20 00:00:00  2017-03-27 00:00:00                           
1  2017-02-01 00:00:00  2017-01-10 00:00:00  2017-01-20 00:00:00   100
2  2017-02-01 00:00:00  2017-01-10 00:00:00  2017-01-27 00:00:00   200
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    This is a great solution -- just what I needed. I can't up-vote it since I don't seem to have enough "reputation" points. Works great though. – P Spence Apr 26 '17 at 21:22
  • 19
    No this method will cause an explosuon of records due to the cartesian product caused by this join. – Scott Boston Jan 09 '18 at 16:54
  • @ScottBoston do you propose a better alternative? there doesn't seem to be a proper way to express this operation in pandas. – pablete Jan 26 '18 at 12:56
  • @pablete See this [SO post](https://stackoverflow.com/a/44601120/6361531) as a possible solution. – Scott Boston Jan 26 '18 at 15:38
  • 1
    great solution ! Just wndering, if with time any better efficient solutions have come up ? even the `numpy` based solution you refer to is in a way bursting out. Any new ideas? – joel.wilson Jan 31 '18 at 19:58
  • 1
    @joel.wilson No, I haven't had any new ideas on doing this merge/join. – Scott Boston Jan 31 '18 at 20:08
  • This solution does not work on large datasets. It will blow it out of proportion. – Starbucks Dec 06 '18 at 19:15
  • @Starbucks Yes, it cause a cartesian explosion in records as noted in [this comment above](https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common/43594038?noredirect=1#comment83325689_43594038). – Scott Boston Dec 06 '18 at 19:31
  • 2
    For future readers, I have added an answer for larger dataframe use cases [here](https://stackoverflow.com/a/54986850/3545273) – Serge Ballesta Mar 04 '19 at 15:55
1

conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

Reusing @scottboston's data :

df_B.conditional_join(
        df_A, 
        ('event_date', 'start_date', '>='), 
        ('event_date', 'end_date', '<='), 
        how = 'right'
    )

        left             right
  event_date  price start_date   end_date
0        NaT    NaN 2017-03-27 2017-04-20
1 2017-01-20  100.0 2017-01-10 2017-02-01
2 2017-01-27  200.0 2017-01-10 2017-02-01

Under the hood, it uses np.searchsorted (binary search) to avoid the cartesian join.

Note that if the intervals do not overlap, pd.IntervalIndex is a more efficient solution.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31