1

I have a very large dataset with hundreds of thousands of customer status records formatted as follows. An account can have multiple separate customers, each with their own status that is independent from the account status.

#Account, Customer,       date, previous_status, current_status
     123,      abc, 2022-12-30,          active,      suspended
     456,      xyz, 2023-01-01,       suspended,         closed
     123,      abc, 2023-01-02,       suspended,         active
     123,      def, 2023-01-02,          active,         closed
     789,      qrs, 2023-01-03,          active,      suspended
     789,      qrs, 2023-01-03,       suspended,         active
     246,      jkl, 2023-01-04,          active,      suspended
     357,      ghi, 2023-01-04,          active,      suspended
     123,      abc, 2023-01-05,       suspended,         active
     246,      jkl, 2023-01-05,       suspended,         active

I need to calculate the number of days the customer was suspended when their status changes to either closed or active.

[https://stackoverflow.com/questions/63037503/calculate-the-time-difference-between-two-rows-with-conditions] got me pretty close, but I'm struggling to adapt it to only take the difference when the account and customer are the same.

https://stackoverflow.com/questions/66128346/calculate-difference-between-dates-for-sequential-pandas-rows-based-on-condition does account for multiple different accounts, but when I try I get a key error on this line

y = temp_df.iloc[-1:, -1:]["or_date"].values[0]

Dataframe

df = pd.DataFrame({'#Account': [123, 456, 123, 123, 789, 789, 246, 357, 123, 246],
                   'Customer': ["abc", "xyz", "abc", "def", "qrs", "qrs", "jkl", "ghi", "abc", "jkl"],
                   'date':  ["2022-12-30", "2023-01-01", "2023-01-02", "2023-01-02", "2023-01-03", 
                             "2023-01-03", "2023-01-04", "2023-01-04", "2023-01-05", "2023-01-05"],
                   'previous_status':  ["active", "suspended", "suspended", "active", "active", 
                                        "suspended", "active", "active", "suspended", "suspended"],
                   'current_status':  ["suspended", "closed", "active", "closed", "suspended", 
                                       "active", "suspended", "suspended", "active", "active"],
                  })
Laurent B.
  • 1,653
  • 1
  • 7
  • 16

1 Answers1

0

You should probably use vector operations for it, it'll run much faster than iloc, map, apply or any sort of loop. Look into numpy.where (or numpy.select if your conditions get long or complex enough). This way you can write your function to essentially operate on the entire column rather than its individual rows (which takes forever)

There's a good talk introducing basic examples of using it: https://youtu.be/nxWginnBklU

In your case, the first thing that comes to my mind - use groupby in order to get data of particular customers, then simply check if the first date is an activation or suspension and based on that, calculate the total time for each of them by pairing the dates (assuming your system works well, this should work alright)

DMSBrian
  • 26
  • 5