I have a very large dataset with hundreds of thousands of customer status records formatted as follows. An account can have multiple separate customers, each with their own status that is independent from the account status.
#Account, Customer, date, previous_status, current_status
123, abc, 2022-12-30, active, suspended
456, xyz, 2023-01-01, suspended, closed
123, abc, 2023-01-02, suspended, active
123, def, 2023-01-02, active, closed
789, qrs, 2023-01-03, active, suspended
789, qrs, 2023-01-03, suspended, active
246, jkl, 2023-01-04, active, suspended
357, ghi, 2023-01-04, active, suspended
123, abc, 2023-01-05, suspended, active
246, jkl, 2023-01-05, suspended, active
I need to calculate the number of days the customer was suspended when their status changes to either closed or active.
[https://stackoverflow.com/questions/63037503/calculate-the-time-difference-between-two-rows-with-conditions] got me pretty close, but I'm struggling to adapt it to only take the difference when the account and customer are the same.
https://stackoverflow.com/questions/66128346/calculate-difference-between-dates-for-sequential-pandas-rows-based-on-condition does account for multiple different accounts, but when I try I get a key error on this line
y = temp_df.iloc[-1:, -1:]["or_date"].values[0]
Dataframe
df = pd.DataFrame({'#Account': [123, 456, 123, 123, 789, 789, 246, 357, 123, 246],
'Customer': ["abc", "xyz", "abc", "def", "qrs", "qrs", "jkl", "ghi", "abc", "jkl"],
'date': ["2022-12-30", "2023-01-01", "2023-01-02", "2023-01-02", "2023-01-03",
"2023-01-03", "2023-01-04", "2023-01-04", "2023-01-05", "2023-01-05"],
'previous_status': ["active", "suspended", "suspended", "active", "active",
"suspended", "active", "active", "suspended", "suspended"],
'current_status': ["suspended", "closed", "active", "closed", "suspended",
"active", "suspended", "suspended", "active", "active"],
})