0

I have a sessions table with three columns

sessions table

And I have a table of user's actions with datestamps

user's actions

To the user's action table, I would like to add a column that states in which session # that action took place, based on the dates

Result

I was thinking of making a for loop where I take each row of user_id, and then for that row and inner loop to check if the date of that row is inside row 1 of Sessions (session 1), if not, next iteration will be to check with row 2 of Sessions table, if not, the third row, and so on untill it matches and we can move back to the outer loop to take the next row of User's actions and repeat the process of iterating over sessions table.

But is there a faster computational way to do it? My tables are more than 10 million rows

  • Though this is an interesting problem, images make it very laborious to reproduce the data and test solutions. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples has good examples of ways to provide better access to your sample data, which will encourage solutions – ALollz Aug 24 '21 at 16:44
  • Hi, can you check with the last row of the expected result ? `02/03/2021 15:21` is not within the range of `02/03/2021 14:43` and `02/03/2021 14:45` – SeaBean Aug 24 '21 at 17:09

1 Answers1

1

You can do it this way:

(Assuming df1 is the sessions table, df2 is the table of user's actions)

1. Convert datetime format

df1['start datetime'] = pd.to_datetime(df1['start datetime'], dayfirst=True)
df1['end datetime'] = pd.to_datetime(df1['end datetime'], dayfirst=True)

df2['datestamp'] = pd.to_datetime(df2['datestamp'], dayfirst=True)

2. Cross join the 2 dataframes:

df3 = df2.merge(df1, how='cross')

Or, if your Pandas version is older than 1.2.0 (December 2020 version), you can use:

df3 = df2.assign(key=1).merge(df1.assign(key=1), on='key').drop('key', axis=1)

3. Filter rows with datestamp within range of start datetime end datetime

df_out = df3.loc[df3['datestamp'].between(df3['start datetime'], df3['end datetime'])]

Result:

(modified the time of last row to make it within range)

print(df_out)

   user_id action           datestamp  session #      start datetime        end datetime
0        1      A 2021-01-15 08:21:00          1 2021-01-15 05:21:00 2021-01-15 20:22:00
4        1      A 2021-01-23 11:50:00          2 2021-01-23 11:21:00 2021-01-23 12:21:00
8        1      B 2021-03-02 14:44:00          3 2021-03-02 14:43:00 2021-03-02 14:45:00

Optionally, you can remove the unwanted columns, as follows:

df_out = df_out.drop(['start datetime', 'end datetime'], axis=1)

Result:

print(df_out)

   user_id action           datestamp  session #
0        1      A 2021-01-15 08:21:00          1
4        1      A 2021-01-23 11:50:00          2
8        1      B 2021-03-02 14:44:00          3
SeaBean
  • 22,547
  • 3
  • 13
  • 25