Pandas merge on date range on multiple values

Question

I have dataframe of events with start and end dates like this:

import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
                    "Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    "End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })

Note: The "Event" column may not have unique values.

I have another dataframe which contains all the holidays:

df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})

I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:

df['holiday_count'] = df.apply(lambda x: len(set(pd.date_range(x['Start'], x['End'])).intersection(set(holidays['Holidays']))), axis=1)

I realize that my solution is quite inefficient for large dataset of df1. Here are a few things which I tried:

Since, it is not an exact match, df1.merge wouldn't help.
I tried using pd.merge_asof, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well.
I tried using pd.IntervalIndex. The issue over there I faced is KeyError for those ranges where there were no holidays.
cross merge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.
Although didn't try, but people were suggesting to use pandas_sql. However, there were comments stating it is slow method.

These trials were based on several stackoverflow questions in the past like:

Have you tested your code before ? It gives an `AttributeError` for *pd.Dataframe* and a `TypeError: Cannot convert input [0 2022-01-04..`. — Timeless, Jun 12 '23 at 10:31
@Timeless for `df1`, I wrote `Dataframe` instead of `DataFrame` (with capital F). Hence, the attribute error. Now, code to define `df1` is working for me. Apologies for the error in markup. — CharcoalG, Jun 12 '23 at 10:36
Of course, one need to import `import pandas as pd` and `from datetime import datetime`. Will edit the code for clarity. — CharcoalG, Jun 12 '23 at 10:37
`df1['holiday_count'] = ..` still gives a `SyntaxError` and a `TypeError` but anyways, I posted an answer below if you're tempted to give it a try. — Timeless, Jun 12 '23 at 10:41

score 3 · Accepted Answer · answered Jun 12 '23 at 10:40

You can try this numpy approach :

sdates, edates = df1["Start"].values, df1["End"].values
hdates = df2["Holidays"].values[:, None]

df1["holiday_count"] = np.sum((hdates >= sdates) & (hdates <= end_dates), axis=0)

# 801 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Output :

print(df1)

  Event      Start        End  holiday_count
0    S1 2022-01-04 2022-01-19              2
1    K1 2022-01-15 2022-01-29              0
2    S2 2022-09-12 2022-09-27              0
3    S3 2022-11-11 2022-11-22              0
4    A1 2022-05-29 2022-06-15              0

Your solution is a significantly faster than mine. Will mark as accepted solution in a day unless there is a better solution which may come up. — CharcoalG, Jun 12 '23 at 10:55

Dejene T. · Answer 2 · 2023-06-12T10:28:28.550

0

I would also suggest this is an alternative solution to your question

 # Function to count the number of holidays between two dates (inclusive)
def count_holidays(start_date, end_date, holidays):
    return sum(start_date <= holiday <= end_date for holiday in holidays)

# Merge df1 and df2 on Event
merged_df = pd.merge(df1, df2, left_on="Event", right_on="Event", how="left")

# Count holidays for each event
merged_df["HolidaysCount"] = merged_df.apply(lambda row: count_holidays(row["Start"], row["End"], row["Holidays"]), axis=1)

# Drop the Holidays column
merged_df.drop("Holidays", axis=1, inplace=True)

print(merged_df)

edited Jun 12 '23 at 10:28

answered Jun 12 '23 at 10:23

Dejene T.

973
8
14

1

I do not think this is any better if not worse solution than mine as your `apply` function has loops with constant re-assignment of `curr_date` variable. – CharcoalG Jun 12 '23 at 10:26
I just avoiding the loop and modified my above code – Dejene T. Jun 12 '23 at 10:28
I do not think code for `merged_df` will work because there is no `right_on="Event"` column in `df2`. Regardless of it, if I understand correctly, you're trying to create a cross merge and then computing the holiday count which fits the date range. As I mentioned in my original post, this would have high memory usage. – CharcoalG Jun 12 '23 at 10:41

score 0 · Answer 3 · answered Jun 12 '23 at 12:05

This is an inequality join, which is solved efficiently with conditional_join:

# pip install pyjanitor
# for better performance, if you can
# install the dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

(df1
.conditional_join(
    df2, 
    ('Start', 'Holidays', '<='), 
    ('End', 'Holidays', '>='), 
    how = 'left')
.groupby(df1.columns.tolist(), sort = False)
.count()
)
                             Holidays
Event Start      End                 
S1    2022-01-04 2022-01-19         2
K1    2022-01-15 2022-01-29         0
S2    2022-09-12 2022-09-27         0
S3    2022-11-11 2022-11-22         0
A1    2022-05-29 2022-06-15         0

Under the hood, it uses binary search, instead of a cartesian join. For large data, this offers more performance/efficiency.

Pandas merge on date range on multiple values

3 Answers3