0

I have a dataframe in the following structure as of now.

Current dataframe

I saw this post here, in which the second answer says that using numpy array for looping huge dataframe is the best.

This is my requirement:

  1. Loop through unique dates
  2. Within unique dates in the dataframe, loop through unique session.
  3. Once I'm inside unique session within unique dates, I need to do some operations

Currently I'm using for loop, but its unbearably slow. Can anyone suggest how to use numpy arrays to meet my requirements? as suggested in this post here?

EDIT:

I'm elaborating my requirement here:
1. Loop through unique dates
Which would give me the following dataframe:
unique days 2. Within unique dates, loop through unique sessionId's.
Which would give me something like this:
unique Sessions 3. Once within unique sessionId within unique date,
Find the timestamp difference between last element and first element
This time difference is added to a list for each unique session.
4. Outside the 2nd loop, I will take the average of the list that is created in the above step.
5. The value we get in step 4 is added to another list

The aim is to find the average time difference between the last and first message of each session per day

Tony Mathew
  • 880
  • 1
  • 12
  • 35
  • 1
    Your possibilities depend on the operations you need to in step 3. NumPy arrays would help if you are doing some matrix operations, otherwise it should not matter whether you are using NumPy arrays or some other kind of variables. You should specify/bring out more details of your problem. Loops in Python can be really slow and I have sometimes used combination of Python and Fortran (f2py) or Python and Java to speed up the program. – msi_gerva Sep 24 '18 at 11:35
  • 2
    df.groupby(['ChatDate", "sessionId"]).apply(lambda x: some_operations(x)) ? – Charles R Sep 24 '18 at 11:35
  • We can help you better if you give us at least a contrived example of what you actually want to do with the data, including desired output for the example input you gave. – John Zwinck Sep 24 '18 at 11:46
  • You can also try using multiprocessing https://docs.python.org/2/library/multiprocessing.html – Shubham R Sep 24 '18 at 11:49
  • @msi_gerva, I have updated the requirements as per your suggestion. Can you please check now and let me know if you have a solution? I have tried itertuples() instead of for loop as well. But even that is very slow – Tony Mathew Sep 24 '18 at 12:52
  • @JohnZwinck, I have updated the question with more elaborate requirement. Can you please have a look at it now? – Tony Mathew Sep 24 '18 at 12:53
  • I do not know your data, but isn't session ID already unique string that is associated only with one particular session i.e. you can skip the loop of the days? If so, I suggest sorting the data by session id's and then by date and time. Afterwards you should be able to go through your data only once to detect/collect the times. – msi_gerva Sep 24 '18 at 13:58
  • @msi_gerva nope, sessionId is not unique. If you see the first screenshot attached, there are multiple rows with the same sessionId. Each row represents a message. And all the rows with same sessionId are the messages sent in that session. I hope I have made it clear for you – Tony Mathew Sep 24 '18 at 18:06
  • @JohnZwinck your solution worked! Thanks mate :) You are a genious! – Tony Mathew Sep 24 '18 at 19:52

1 Answers1

2

Use groupby:

grouped = df.groupby(['ChatDate", "sessionId"])
timediff = grouped.timestamp.last() - grouped.timestamp.first() # or max-min
timediff.mean() # this is your step 4
John Zwinck
  • 239,568
  • 38
  • 324
  • 436