1

I have a dataframe has users, actions and the time users took the actions. I want to group actions into a list if they satisfy BOTH of the two conditions: 1. actions were taken by the same user 2. actions were taken within 20 minutes.

At the moment I'm trying to use timedelta to calculate time difference with iteration and read this post but that's not what I'm looking for. I struggle to find similar examples.

the dataframe has thousands of rows, this is part of it

user    action      time
A       browse      2018-07-01 06:00:00
A       edit        2018-07-01 06:10:00
B       signin      2018-07-01 06:00:00
B       browse      2018-07-01 06:11:00
B       edit        2018-07-01 07:00:00

The expected output will be a list of the actions that satisfied the conditions

output
[[browse, edit], [signin, browse]]

The last 'edit' did by user B is not in it because (07:00:00) - (06:11:00) > 20 min

Any suggestions about how can I do this ? Thank you very much in advanced !

Osca
  • 1,588
  • 2
  • 20
  • 41

1 Answers1

2

IIUC you can use,

df['time'] = pd.to_datetime(df.time) 

cond = df.groupby('user')['time'].diff().bfill().lt(pd.Timedelta('20m'))

df1 = df[cond].groupby('user')['action'].apply(list).tolist()

print (df1)

[['browse', 'edit'], ['signin', 'browse']]
Abhi
  • 4,068
  • 1
  • 16
  • 29
  • Thank you very much ! Just a quick question, is there anyway that I can randomly check are actions being allocated correctly? I tried `for i in df1: for j in i: if j == 'edit': print(i)` but it print out the whole list Thank you! – Osca Nov 02 '18 at 04:47
  • @Osca It's not clear what you want to check here. If you want to confirm then try it on a small subset of your data and verify. – Abhi Nov 02 '18 at 05:01
  • Thanks but I want to return i (the list itself) if 'edit' was found. Like this: ['browse', 'edit'] – Osca Nov 02 '18 at 05:04
  • @Osca You can use `for i in df1: if 'edit' in i: print(i)` This will return `['browse', 'edit']` – Abhi Nov 02 '18 at 05:12
  • @Abhi, Would you be able to explain `.diff().bfill().` in the condition part , literal i know its `difference` and `backfill` but would like to know how its working here – Karn Kumar Nov 02 '18 at 07:47
  • 1
    @pygo Sure, when you take the `diff()` with `groupby` . The first row of every group will be `NaT` so `bfill` ensures that first row is also selected. – Abhi Nov 02 '18 at 07:55