2

I'm a beginner in Python Data Science. I'm working on clickstream data and want to find out the duration of a session. For that I find the start time and end time of the session. However on subtraction, I'm getting wrong answer for the same. Here is the data

       Sid                    Tstamp     Itemid  Category
0        1  2014-04-07T10:51:09.277Z  214536502         0
1        1  2014-04-07T10:54:09.868Z  214536500         0
2        1  2014-04-07T10:54:46.998Z  214536506         0
3        1  2014-04-07T10:57:00.306Z  214577561         0
4        2  2014-04-07T13:56:37.614Z  214662742         0
5        2  2014-04-07T13:57:19.373Z  214662742         0
6        2  2014-04-07T13:58:37.446Z  214825110         0
7        2  2014-04-07T13:59:50.710Z  214757390         0
8        2  2014-04-07T14:00:38.247Z  214757407         0
9        2  2014-04-07T14:02:36.889Z  214551617         0
10       3  2014-04-02T13:17:46.940Z  214716935         0
11       3  2014-04-02T13:26:02.515Z  214774687         0
12       3  2014-04-02T13:30:12.318Z  214832672         0

I referred this question for the code- Timestamp Conversion

Here is my code-

k.columns=['Sid','Tstamp','Itemid','Category']
k=k.loc[:,('Sid','Tstamp')]

#Find max timestamp
idx=k.groupby(['Sid'])['Tstamp'].transform(max) == k['Tstamp'] 
ah=k[idx].reset_index()

#Find min timestamp
idy=k.groupby(['Sid'])['Tstamp'].transform(min) == k['Tstamp']
ai=k[idy].reset_index()

#grouping by Sid and applying count to retain the distinct Sid values
kgrp=k.groupby('Sid').count()

i=0
for temp1,temp2 in zip(ah['Tstamp'],ai['Tstamp']):
    sv1= datetime.datetime.strptime(temp1, "%Y-%m-%dT%H:%M:%S.%fZ")
    sv2= datetime.datetime.strptime(temp2, "%Y-%m-%dT%H:%M:%S.%fZ")
    d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0)
    d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0)
    kgrp.loc[i,'duration']= d1-d2   
    i=i+1

Here is the output.

 
kgrp
Out[5]: 
      Tstamp  duration
Sid                   
1          4   359.275
2          6   745.378
3          3  1034.468

For session id 2, the duration should be close to 6 minutes however I'm getting almost 12 minutes. I reckon I'm making some silly mistake here.

Also, I'm grouping by Sid and applying count on it so as to get the Sid column and store each duration as a separate column. Is there any easier method through which I can store only the Sid (not the 'Tstamp' Count Column) and its duration values?

Community
  • 1
  • 1
ADITYA AWALKAR
  • 117
  • 1
  • 8

1 Answers1

1

You are assigning the duration value to the wrong label. In your test data sid starts from 1 but i starts from 0:

# for sid 1, i == 0
kgrp.loc[i,'duration']= d1-d2
i=i+1

Update

A more pythonic way to handle this :)

def calculate_duration(dt1, dt2):
    # do the calculation here, return the duration in seconds

k = k.loc[:, ('Sid', 'Tstamp')]
result = k.groupby(['Sid'])['Tstamp'].agg({
    'Duration': lambda x: calculate_duration(x.max(), x.min()),
    'Count': lambda x: x.count()
})
Shaung
  • 508
  • 4
  • 11
  • Thank you. I also found out that not all Sid are present. eg. Sid 5 is not in the data so in the 5th row, duration of Sid 6 gets stored. What can be done to solve this problem since I'm iterating i continuously without skipping any values? – ADITYA AWALKAR Feb 06 '16 at 19:44
  • I made the following changes in the code- 1) changed index to Sid 2) Removed iterator i 3) Added 3rd variable in for loop It works fine now. However unnecessary I had to add 3rd variable to access the index of the row when it's the same index as that of temp1 (because I'm making changes in ah only). So is there a way to directly get the index from temp1. Here is my current code- **for temp1,temp2,temp3 in zip(ah['Tstamp'],ah.index,ai['Tstamp']): ah.loc[temp2,'duration']= d1-d2** – ADITYA AWALKAR Feb 07 '16 at 13:08
  • @ADITYAAWALKAR I've updated my answer. aggregation method should works for this situation. – Shaung Feb 08 '16 at 05:12