What data type should I use in python to speed up iteration over a dataframe?

Question

I want to create a new dataframe using information from a given dataset. What I'm doing right now uses .iterrows(), and it's frustratingly slow. This is what I've got so far:

The original dataset (data) has two columns: user ID and a timestamp. I'm creating new dataframe (session_data) with three columns: user ID, session_start, and session duration.

#create empty dataframe
session_data = pd.DataFrame(columns=['ID', 'session_start', 'session_duration']) 

for index, row in data.iterrows():
    if row['ID'] in session_data.ID:
        # update the session duration 
    else:
        session = pd.DataFrame([[row['ID'], row['timestamp'], 0]], columns=['ID', 'session_start', 'session_duration'])
        session_data = session_data.append(session)

I'm thinking that instead of using a dataframe for session_data, I should create some sort of other object and use that to create a dataframe after I've iterated through the data. However as a noob I'm really struggling with what data type to use instead of the session_data dataframe, and whether I need to be using .iterrows() at all.

Any help is appreciated! Please let me know if I need to add more information.

EDIT: Here's some more information to create a reproducible example. To get data, I'm linking to an external .csv with 100,000 rows. For convenience, here's a sample dataframe:

data = pd.DataFrame({'ID': ['1234', '5678', '5678', '1234'], 
                   'timestamp': ['12/23/14 16:53', '12/23/14 16:50', '12/23/14 16:52', '12/23/14 17:20']})

I've created session_data in the above snippet like so:

#create empty dataframe
session_data = pd.DataFrame(columns=['ID', 'session_start', 'session_duration'])

In the end, I want session data to look something like this:

   user_id  session_start session_duration

0  1234    12/23/14 16:53  27 minutes
1  5678    12/23/14 16:50  2 minutes

A proper [mcve] would help us to understand your question better, it seems like you shouldn't need to use iterrows at all, but a [Good pandas sample](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) would make it more clear — G. Anderson, Aug 17 '20 at 23:05
Thanks @G.Anderson - I've updated my question above, let me know if I've addressed your comment and if there's anything I can do to make things more clear — ellen, Aug 17 '20 at 23:39

null · Accepted Answer · 2020-08-18T08:47:25.333

I must say this is a bit misuse of pandas. Pandas is designed in a way that it can speed things up when you work with vectorization. Meaning, iterating over rows or cols does not really help with pandas, you only use those if you desperately need them. Moreover, with builtin keyword like in there are methods in pandas to search faster like pd.Series.isin isin() method searchs for exact match of given input.

With update you just want the latest timestamp for a particular ID, at least it is what this code does.

You can easily achieve it by doing this.

import pandas as pd

data = pd.DataFrame({'ID': ['1234', '5678', '5678', '1234'], 
                   'timestamp': ['12/23/14 16:53', '12/23/14 16:50', '12/23/14 16:52', '12/23/14 17:20']})

data['timestamp'] = pd.to_datetime(data['timestamp'])

session_start_data = data.groupby('ID').head(1)
session_start_data.rename(columns = {'timestamp': 'session_start'}, inplace=True)

session_end_data = data.groupby('ID').tail(1)
session_end_data.rename(columns = {'timestamp': 'session_end'}, inplace=True)


session_data = session_start_data.merge(session_end_data, on='ID', how='inner')
session_data['session_duration'] = session_data['session_end'] - session_data['session_start']

Output

     ID       session_start         session_end session_duration
0  1234 2014-12-23 16:53:00 2014-12-23 17:20:00         00:27:00
1  5678 2014-12-23 16:50:00 2014-12-23 16:52:00         00:02:00

If you want, you can safely remove 'session_end' column

Thanks @null I've updated my question above, I don't only want the latest timestamp for a particular ID as some people will log in multiple times and I would like to capture different sessions for each user (they will time out after a certain # of minutes) — ellen, Aug 17 '20 at 23:41
@ellen I updated the answer with the proper way of using pandas. Hope it helps. — null, Aug 18 '20 at 13:29

What data type should I use in python to speed up iteration over a dataframe?

1 Answers1