0

I am working on a dataset which have parent_id, id, start_date and end_date as columns.

I want to check that in a calendar year, how many days a parent_id was not having an id running. The problem is with the “Overlap”. I want to exclude the overlapping days. How can I solve this problem?

Sample Input:

parent_id   id  start_date  end_date

ABCD    1   2019-10-26  2020-06-30

ABCD    2   2020-01-02  2020-04-01

ABCD    3   2020-07-09  2020-12-31

ABCD    4   2020-07-14  2020-09-26

EFGH    22  2020-01-02  2020-03-31

EFGH    23  2020-01-02  2020-04-01

EFGH    24  2020-01-15  2020-04-07

Sample Output:

parent_id   Year (2020)

ABCD    8

EFGH    268

So in year 2020 parent_id ABCD was inactive for 8 days. (July 1st to July 8th).

I did refer to similar questions on the portal but none of it worked for my problem.

Efficient date range overlap calculation in python?

How to find range overlap in python?

Find date range overlap in python

ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
Dau_uaD
  • 88
  • 1
  • 9

1 Answers1

1

This might get you what you need:

def count_inactive_days(data):
    all_days = pd.date_range(pd.Timestamp(data["start_date"].min().year,1,1), pd.Timestamp.today(), freq="D")
    active_days = pd.DatetimeIndex(sorted(set(date for sublist in data.apply(lambda x: pd.date_range(x["start_date"], x["end_date"], freq="D"), axis=1).to_list() for date in sublist)))
    inactive_days = all_days.to_series().resample("Y").count() - active_days.to_series().resample("Y").count()
    inactive_days.index = inactive_days.index.year
    return inactive_days

>>> pd.concat([df.groupby("parent_id").apply(count_inactive_days)]).unstack()
            2019   2020
parent_id              
ABCD       298.0    8.0
EFGH         NaN  269.0
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • I did try to implement this. The only catch is for future days. For 2021 some instances show 344 days inactive which should be restricted to current date. – Dau_uaD Jul 08 '21 at 19:00
  • Gotcha, edited `all_days` to only take days until today into account. – not_speshal Jul 08 '21 at 19:02
  • The code works with a correct logic now. Added a line to avoid negative values when End_Date is in future. df[df < 0] = 0 Thanks for your help. – Dau_uaD Jul 08 '21 at 19:26