0

I have the following table and I need to re-calculate the StartDate based on some business rules that I was given. Specifically, when the StartDate of the succeeding session is within one calendar day (i.e., < 2 days) the sessions are combined, using the earlier StartDate and the later EndDate.

StartDate EndDate DaysBetweenSessions
2011-04-15 2011-04-21 3
2011-04-24 2011-04-27 3
2011-04-30 2011-05-01 0
2011-05-01 2011-05-06 5
2011-05-11 2011-05-15 1
2011-05-16 2011-05-17 0
2011-05-17 2011-05-25 10

What I need as a result:

StartDate EndDate DaysBetweenSessions NewStartDate NewEndDate
2011-04-15 2011-04-21 3 2011-04-15 2011-04-21
2011-04-24 2011-04-27 3 2011-04-24 2011-04-27
2011-04-30 2011-05-01 0 2011-04-30 2011-05-06
2011-05-01 2011-05-06 5 2011-04-30 2011-05-06
2011-05-11 2011-05-15 1 2011-05-11 2011-05-25
2011-05-16 2011-05-17 0 2011-05-11 2011-05-25
2011-05-17 2011-05-25 10 2011-05-11 2011-05-25

I have attempted to use df.shift to get the PreviousStartDate, PreviousEndDate, NextStartDate, NextEndDate, and then tried to loop through the dataset with df.iterrows(), calculating the new dates, but I run into a technical issue that I don't fully understand - while in the loop that was used to update a value, I cannot use an operator on that new value - the new value hasn't been written to the df yet.

adm-gis
  • 146
  • 10

1 Answers1

1

Try with groupby and tranform:

#convert to datetime if necessary
df["StartDate"] = pd.to_datetime(df["StartDate"])
df["EndDate"] = pd.to_datetime(df["EndDate"])

df["NewStartDate"] = df.groupby((df["DaysBetweenSessions"]>=2).cumsum().shift().fillna(0))["StartDate"].transform("min")
df["NewEndDate"] = df.groupby((df["DaysBetweenSessions"]>=2).cumsum().shift().fillna(0))["EndDate"].transform("max")

>>> df
  StartDate    EndDate  DaysBetweenSessions NewStartDate NewEndDate
0 2011-04-15 2011-04-21                    3   2011-04-15 2011-04-21
1 2011-04-24 2011-04-27                    3   2011-04-24 2011-04-27
2 2011-04-30 2011-05-01                    0   2011-04-30 2011-05-06
3 2011-05-01 2011-05-06                    5   2011-04-30 2011-05-06
4 2011-05-11 2011-05-15                    1   2011-05-11 2011-05-25
5 2011-05-16 2011-05-17                    0   2011-05-11 2011-05-25
6 2011-05-17 2011-05-25                   10   2011-05-11 2011-05-25
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • Thank you so much. This definitely seems to work. Can you explain the purpose of the ```shift()``` in this case? I'm a bit confused. – adm-gis Sep 07 '21 at 22:32
  • 1
    Just run `df["DaysBetweenSessions"]>=2).cumsum()` and see the output. It doesn't align with the groups you want. – not_speshal Sep 08 '21 at 12:21