1

I have user session ids and the timestamp of the session ids where some event occurs. I want to calculate the time between the first event and the last event. Please see the example below:

session_id   timestamp
sess1        2018-11-05 14:28:25.260
sess2        2018-11-04 12:14:59.576
sess2        2018-11-04 11:55:00.584
sess2        2018-11-04 12:16:44.702
sess3        2018-11-04 12:04:37.419

I want to calculate the difference between the first and last timestamp of sess2 and likewise all other session_ids like this:

session_id   timeSpent
sess1        1
sess2        125 (for example)        
sess3        1

How to calculate this?

N91
  • 395
  • 1
  • 3
  • 14
  • Have you looked at [this question](https://stackoverflow.com/questions/22923775/calculate-pandas-dataframe-time-difference-between-two-columns-in-hours-and-minu) or [this question](https://stackoverflow.com/questions/8419564/difference-between-two-dates-in-python)? – Nathaniel Mar 30 '19 at 18:50
  • 1
    this is different – N91 Mar 30 '19 at 19:18

2 Answers2

2

Use:

#convert column to datetimes if necessary
df['timestamp'] = pd.to_datetime(df['timestamp'])

#aggregate min and max
df1 = df.groupby('session_id')['timestamp'].agg(['min','max'])
#subtract to new column
df1['timeSpent'] = df1.pop('max') - df1.pop('min')
df1 = df1.reset_index()
print (df1)
  session_id       timeSpent
0      sess1        00:00:00
1      sess2 00:21:44.118000
2      sess3        00:00:00

One row solution with GroupBy.agg and tuple:

df1 = (df.groupby('session_id')['timestamp']
        .agg([('timeSpent', lambda x: x.max() - x.min())])
        .reset_index())
print (df1)
  session_id       timeSpent
0      sess1        00:00:00
1      sess2 00:21:44.118000
2      sess3        00:00:00

If need output in seconds convert timedeltas by Series.dt.total_seconds:

df1['timeSpent'] = (df1.pop('max') - df1.pop('min')).dt.total_seconds()
df1 = df1.reset_index()
print (df1)
  session_id  timeSpent
0      sess1      0.000
1      sess2   1304.118
2      sess3      0.000

One row solution:

df1 = (df.groupby('session_id')['timestamp']
        .agg([('timeSpent', lambda x: x.max() - x.min())])
        .assign(timeSpent = lambda x: x['timeSpent'].dt.total_seconds())
        .reset_index())
print (df1)
  session_id  timeSpent
0      sess1      0.000
1      sess2   1304.118
2      sess3      0.000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You can use groupby in combination with apply and substract max - min:

df1 = df.groupby('session_id').timestamp.apply(lambda x: x.max() - x.min()).reset_index()

df1.rename({'timestamp':'timeSpent'},axis=1,inplace=True)

print(df1)
  session_id       timeSpent
0      sess1        00:00:00
1      sess2 00:21:44.118000
2      sess3        00:00:00

In seconds:

df1 = df.groupby('session_id').timestamp.apply(lambda x: x.max() - x.min()).reset_index()
df1.rename({'timestamp':'timeSpent'},axis=1,inplace=True)
df1['timeSpent'] = df1['timeSpent'].dt.total_seconds()

print(df1)
  session_id  timeSpent
0      sess1      0.000
1      sess2   1304.118
2      sess3      0.000
Erfan
  • 40,971
  • 8
  • 66
  • 78