I have two dataframes. One contains data on user subscriptions, another on user session.
Example of subscription data (df_subscriptions):
user_id created ended
10238 140baa7a-1641-41b5-a85b-c43dc9e12699 2021-08-13 19:37:11.373039 2021-09-12 19:37:11.373039
10237 fbfa999c-9c56-4f06-8cf9-3c5deb32d5d2 2021-08-13 15:25:07.149982 2021-09-12 15:25:07.149982
6256 a55e64b0-a783-455e-bd9d-edbb4815786b 2021-08-13 18:31:36.083681 2021-09-12 18:31:36.083681
6257 ca2c0ee1-9810-4ce7-a2ec-c036d0b8a380 2021-08-13 16:29:52.981836 2021-09-12 16:29:52.981836
7211 24378efd-e821-4a51-a3e6-39c30243a078 2021-08-13 19:58:19.434908 2021-09-12 19:58:19.434908
Example of session data:
user_id session_start session_duration
11960653 6f51df1a-8c2b-4ddb-9299-b36f250b05dc 2020-01-05 11:39:29.367 165.880005
80076 697e1c0a-c026-4104-b13f-1fd74eec5890 2021-01-31 02:16:33.935 22.883301
1577621 02b23671-8ce3-452b-b551-03b5ea7dce47 2021-05-18 02:07:32.589 4.283300
1286532 a506fb53-3505-44db-880a-27ad483151f0 2020-07-29 16:47:51.908 51.000000
18875432 1ea77db5-fe4a-414f-ba47-1f448175df3f 2020-10-17 04:00:35.269 360.733307
I need to calculate the total time user spend on the service while his subscription is active. The code below gives me correct/expected result, but takes A LOT of time on the real data:
def sessions_during_sub (user_id, start_date, end_date):
result = df_sessions.loc[(df_sessions.user_id == user_id)&
(df_sessions.session_start >= start_date)&
(df_sessions.session_start <= end_date)].session_duration.sum()
return result
df_subscriptions['sessions'] = df_subscriptions.apply(lambda x: sessions_during_sub(x['user_id'], x['created'], x['ended']), axis=1)
Is there any way to to do it proper pandas way/vectorized? Any ideas of how to speed it up really.