0

I have dataframe of events with start and end dates like this:

import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({"Event": ["S1", "K1", "S2", "S3", "A1"],
                    "Start": [datetime(2022,1,4), datetime(2022,1,15), datetime(2022,9,12), datetime(2022,11,11), datetime(2022,5,29)],
                    "End": [datetime(2022,1,19), datetime(2022,1, 29), datetime(2022,9,27), datetime(2022,11,22), datetime(2022,6,15)]
                   })

Note: The "Event" column may not have unique values.

I have another dataframe which contains all the holidays:

df2 = pd.DataFrame({"Holidays": [datetime(2022,1,1), datetime(2022,1,6), datetime(2022,1,13), ....]})

I want to know for every event how many holidays are there in between the start and end date both inclusive. My solution:

df['holiday_count'] = df.apply(lambda x: len(set(pd.date_range(x['Start'], x['End'])).intersection(set(holidays['Holidays']))), axis=1)

I realize that my solution is quite inefficient for large dataset of df1. Here are a few things which I tried:

  1. Since, it is not an exact match, df1.merge wouldn't help.
  2. I tried using pd.merge_asof, however, the joins count only to 1. Over here, the start and end period may contain multiple holidays or no holidays as well.
  3. I tried using pd.IntervalIndex. The issue over there I faced is KeyError for those ranges where there were no holidays.
  4. cross merge followed by filter is one option, but I think, it'd have a high memory imprint which I want to avoid.
  5. Although didn't try, but people were suggesting to use pandas_sql. However, there were comments stating it is slow method.

These trials were based on several stackoverflow questions in the past like:

  1. Best way to join / merge by range in pandas
  2. Fastest way to merge pandas dataframe on ranges
  3. Merge pandas dataframes where one value is between two others
  4. How to join two dataframes for which column values are within a certain range?
CharcoalG
  • 185
  • 4
  • 12
  • Have you tested your code before ? It gives an `AttributeError` for *pd.Dataframe* and a `TypeError: Cannot convert input [0 2022-01-04..`. – Timeless Jun 12 '23 at 10:31
  • @Timeless for `df1`, I wrote `Dataframe` instead of `DataFrame` (with capital F). Hence, the attribute error. Now, code to define `df1` is working for me. Apologies for the error in markup. – CharcoalG Jun 12 '23 at 10:36
  • Of course, one need to import `import pandas as pd` and `from datetime import datetime`. Will edit the code for clarity. – CharcoalG Jun 12 '23 at 10:37
  • `df1['holiday_count'] = ..` still gives a `SyntaxError` and a `TypeError` but anyways, I posted an answer below if you're tempted to give it a try. – Timeless Jun 12 '23 at 10:41

3 Answers3

3

You can try this approach :

sdates, edates = df1["Start"].values, df1["End"].values
hdates = df2["Holidays"].values[:, None]

df1["holiday_count"] = np.sum((hdates >= sdates) & (hdates <= end_dates), axis=0)

# 801 µs ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Output :

print(df1)

  Event      Start        End  holiday_count
0    S1 2022-01-04 2022-01-19              2
1    K1 2022-01-15 2022-01-29              0
2    S2 2022-09-12 2022-09-27              0
3    S3 2022-11-11 2022-11-22              0
4    A1 2022-05-29 2022-06-15              0
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • 1
    Your solution is a significantly faster than mine. Will mark as accepted solution in a day unless there is a better solution which may come up. – CharcoalG Jun 12 '23 at 10:55
0

I would also suggest this is an alternative solution to your question

 # Function to count the number of holidays between two dates (inclusive)
def count_holidays(start_date, end_date, holidays):
    return sum(start_date <= holiday <= end_date for holiday in holidays)

# Merge df1 and df2 on Event
merged_df = pd.merge(df1, df2, left_on="Event", right_on="Event", how="left")

# Count holidays for each event
merged_df["HolidaysCount"] = merged_df.apply(lambda row: count_holidays(row["Start"], row["End"], row["Holidays"]), axis=1)

# Drop the Holidays column
merged_df.drop("Holidays", axis=1, inplace=True)

print(merged_df)
Dejene T.
  • 973
  • 8
  • 14
  • 1
    I do not think this is any better if not worse solution than mine as your `apply` function has loops with constant re-assignment of `curr_date` variable. – CharcoalG Jun 12 '23 at 10:26
  • I just avoiding the loop and modified my above code – Dejene T. Jun 12 '23 at 10:28
  • I do not think code for `merged_df` will work because there is no `right_on="Event"` column in `df2`. Regardless of it, if I understand correctly, you're trying to create a cross merge and then computing the holiday count which fits the date range. As I mentioned in my original post, this would have high memory usage. – CharcoalG Jun 12 '23 at 10:41
0

This is an inequality join, which is solved efficiently with conditional_join:

# pip install pyjanitor
# for better performance, if you can
# install the dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

(df1
.conditional_join(
    df2, 
    ('Start', 'Holidays', '<='), 
    ('End', 'Holidays', '>='), 
    how = 'left')
.groupby(df1.columns.tolist(), sort = False)
.count()
)
                             Holidays
Event Start      End                 
S1    2022-01-04 2022-01-19         2
K1    2022-01-15 2022-01-29         0
S2    2022-09-12 2022-09-27         0
S3    2022-11-11 2022-11-22         0
A1    2022-05-29 2022-06-15         0

Under the hood, it uses binary search, instead of a cartesian join. For large data, this offers more performance/efficiency.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31