Is there an efficient way of creating a dataframe column based on data in another dataframe?

Question

I have two dataframes. One contains data on user subscriptions, another on user session.

Example of subscription data (df_subscriptions):

        user_id                                 created                     ended
10238   140baa7a-1641-41b5-a85b-c43dc9e12699    2021-08-13 19:37:11.373039  2021-09-12 19:37:11.373039
10237   fbfa999c-9c56-4f06-8cf9-3c5deb32d5d2    2021-08-13 15:25:07.149982  2021-09-12 15:25:07.149982
6256    a55e64b0-a783-455e-bd9d-edbb4815786b    2021-08-13 18:31:36.083681  2021-09-12 18:31:36.083681
6257    ca2c0ee1-9810-4ce7-a2ec-c036d0b8a380    2021-08-13 16:29:52.981836  2021-09-12 16:29:52.981836
7211    24378efd-e821-4a51-a3e6-39c30243a078    2021-08-13 19:58:19.434908  2021-09-12 19:58:19.434908

Example of session data:

            user_id                                 session_start           session_duration
11960653    6f51df1a-8c2b-4ddb-9299-b36f250b05dc    2020-01-05 11:39:29.367 165.880005
80076       697e1c0a-c026-4104-b13f-1fd74eec5890    2021-01-31 02:16:33.935 22.883301
1577621     02b23671-8ce3-452b-b551-03b5ea7dce47    2021-05-18 02:07:32.589 4.283300
1286532     a506fb53-3505-44db-880a-27ad483151f0    2020-07-29 16:47:51.908 51.000000
18875432    1ea77db5-fe4a-414f-ba47-1f448175df3f    2020-10-17 04:00:35.269 360.733307

I need to calculate the total time user spend on the service while his subscription is active. The code below gives me correct/expected result, but takes A LOT of time on the real data:

def sessions_during_sub (user_id, start_date, end_date):
    result = df_sessions.loc[(df_sessions.user_id == user_id)&
                             (df_sessions.session_start >= start_date)&
                             (df_sessions.session_start <= end_date)].session_duration.sum()
    return result

df_subscriptions['sessions'] = df_subscriptions.apply(lambda x: sessions_during_sub(x['user_id'], x['created'], x['ended']), axis=1)

Is there any way to to do it proper pandas way/vectorized? Any ideas of how to speed it up really.

You are manually doing a merge operation on the `user_id` key. I'd suggest you read up on Pandas `merge` operations (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) to work out a solution. Perhaps close this question and come back with a new question if you get stuck trying to work out the merge code! — w-m, Aug 27 '21 at 12:56
No, I'm not merging on user_id key. First of all, I need to 'merge' user sessions which fall between dates of active user subscription. Second, one user can have several subscriptions. I didn't find anything in the merge documentation which would allow for such complex logic to be executed during merge. If I'm mistaken a more precise pointer to this would be much appreciated. — Eldar Vagapov, Aug 28 '21 at 10:19
I wrote some example code in an answer, see whether you can work with that. — w-m, Aug 29 '21 at 12:56
Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). The sample data you provided has not intersection on `user_id`. Make sure you present a minimal case that includes the "multiple sessions per user" issue you are facing. — Pierre D, Aug 29 '21 at 13:45

score 0 · Answer 1 · answered Aug 29 '21 at 12:56

Create some example data:

subs = pd.DataFrame(zip(["user_0", "user_0", "user_1", "user_2"], [1900, 1920, 1950, 2000], [1910, 1930, 2000, 2020]), columns=["user_id", "created", "ended"])

  user_id  created  ended
0  user_0     1900   1910
1  user_0     1920   1930
2  user_1     1950   2000
3  user_2     2000   2020

sessions = pd.DataFrame(zip(["user_0", "user_0", "user_0", "user_2"], [1905, 1915, 1925, 2005], [1.0, 5.0, 2.0, 7.0]), columns=["user_id", "session_start", "session_duration"])

  user_id  session_start  session_duration
0  user_0           1905               1.0
1  user_0           1915               5.0
2  user_0           1925               2.0
3  user_2           2005               7.0

The point of merging is to create a table with all the subscription and session data in the same row. This is similar to checking the user_id equality when looping through all the rows in both arrays in applying sessions_during_sub in the code in your questions:

merged = pd.merge(subs, sessions, on="user_id")


  user_id  created  ended  session_start  session_duration
0  user_0     1900   1910           1905               1.0
1  user_0     1900   1910           1915               5.0
2  user_0     1900   1910           1925               2.0
3  user_0     1920   1930           1905               1.0
4  user_0     1920   1930           1915               5.0
5  user_0     1920   1930           1925               2.0
6  user_2     2000   2020           2005               7.0

Having multiple subscriptions and multiple sessions per user is not an issue here, you just get multiple resulting rows with some duplicated data. You can then write some logic to check the subscription range, like so:

in_subscription_range = (merged.session_start >= merged.created) & (merged.session_start < merged.ended)

And finally compute the sum of the session duration, for example per user_id, like so: merged[in_subscription_range].groupby("user_id").session_duration.sum()

user_id
user_0    3.0
user_2    7.0
Name: session_duration, dtype: float64

If your original data contains temporally overlapping subscriptions or sessions, you need to fix that before merging, otherwise you may count the durations multiple times. But the same issue exists for your example code.

Is there an efficient way of creating a dataframe column based on data in another dataframe?

1 Answers1