I have dataframe of events with start and end dates like this:
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
"Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
"End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
})
Note: The "Event"
column may not have unique values.
I have another dataframe which contains all the holidays:
df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})
I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:
df['holiday_count'] = df.apply(lambda x: len(set(pd.date_range(x['Start'], x['End'])).intersection(set(holidays['Holidays']))), axis=1)
I realize that my solution is quite inefficient for large dataset of df1
. Here are a few things which I tried:
- Since, it is not an exact match,
df1.merge
wouldn't help. - I tried using
pd.merge_asof
, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well. - I tried using
pd.IntervalIndex
. The issue over there I faced isKeyError
for those ranges where there were no holidays. cross
merge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.- Although didn't try, but people were suggesting to use
pandas_sql
. However, there were comments stating it is slow method.
These trials were based on several stackoverflow questions in the past like: