0

I am trying to modify the overlapping time period problem so that if there is 1 day difference between dates, it should still be counted as an overlap. As long as the difference in dates is less than 2 days it should be seen as an overlap.

This is the dataframe containing the dates

df_dates = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103, 104, 104, 104, 102, 104, 104, 103, 106, 106, 106], 
                         "start dates": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2002,10,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 8), pd.Timestamp(1993, 1, 1), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 16), pd.Timestamp(2002, 11, 16), pd.Timestamp(2005, 2, 23), pd.Timestamp(2005, 10, 11), pd.Timestamp(2015, 2, 9), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 12, 21)],
                         "end dates": [pd.Timestamp(2002, 1, 3), pd.Timestamp(2002, 12, 3),pd.Timestamp(2002,11,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2004, 11, 1), pd.Timestamp(2015, 2, 8), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 15) , pd.Timestamp(2005, 2, 21),  pd.Timestamp(2003, 2, 16), pd.Timestamp(2005, 10, 8), pd.Timestamp(2005, 10, 21), pd.Timestamp(2015, 2, 17), pd.Timestamp(2011, 12, 31), pd.Timestamp(2011, 11, 25), pd.Timestamp(2011, 12, 22)]
                        })

This was helpful with answering the overlap question but I am not sure how to modify it (red circle) to include 1 day difference

enter image description here

This was my attempt at answering the question, which kind of did (red circle), but then the overlap calculation is not always right (yellow circle)

def Dates_Restructure(df, pers_id, start_dates, end_dates):
    df.sort_values([pers_id, start_dates], inplace=True)
    df['overlap'] = (df.groupby(pers_id)
                     .apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
                     .reset_index(level=0, drop=True))
    df['cumsum'] = df.groupby(pers_id)['overlap'].cumsum()
    return df.groupby([pers_id, 'cumsum']).aggregate({start_dates: min, end_dates: max}).reset_index()  

enter image description here

I will appreciate your help with this. Thanks

Iruoma
  • 27
  • 8

1 Answers1

0

This was the answer I came up with and it worked. I combined the 2 solutions in my question to get this solution.

def Dates_Restructure(df_dates, pers_id, start_dates, end_dates):
    df2 = df_dates.copy()
    startdf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[start_dates], 'start_end': 1})
    enddf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[end_dates], 'start_end': -1})
    mergedf2 = pd.concat([startdf2, enddf2]).sort_values([pers_id, 'time'])
    mergedf2['cumsum'] = mergedf2.groupby(pers_id)['start_end'].cumsum()
    mergedf2['new_start'] = mergedf2['cumsum'].eq(1) & mergedf2['start_end'].eq(1)
    mergedf2['group'] = mergedf2.groupby(pers_id)['new_start'].cumsum()
    df2['group_id'] = mergedf2['group'].loc[mergedf2['start_end'].eq(1)]
    df3 = df2.groupby([pers_id, 'group_id']).aggregate({start_dates: min, end_dates: max}).reset_index()  
    df3.sort_values([pers_id, start_dates], inplace=True)
    df3['overlap'] = (df3.groupby(pers_id).apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
                      .reset_index(level=0, drop=True))
    df3['GROUP_ID'] = df3.groupby(pers_id)['overlap'].cumsum()
    return df3.groupby([pers_id, 'GROUP_ID']).aggregate({start_dates: min, end_dates: max}).reset_index()       
Iruoma
  • 27
  • 8