I want to create a new dataframe using information from a given dataset. What I'm doing right now uses .iterrows()
, and it's frustratingly slow. This is what I've got so far:
The original dataset (data
) has two columns: user ID and a timestamp. I'm creating new dataframe (session_data
) with three columns: user ID, session_start, and session duration.
#create empty dataframe
session_data = pd.DataFrame(columns=['ID', 'session_start', 'session_duration'])
for index, row in data.iterrows():
if row['ID'] in session_data.ID:
# update the session duration
else:
session = pd.DataFrame([[row['ID'], row['timestamp'], 0]], columns=['ID', 'session_start', 'session_duration'])
session_data = session_data.append(session)
I'm thinking that instead of using a dataframe for session_data, I should create some sort of other object and use that to create a dataframe after I've iterated through the data. However as a noob I'm really struggling with what data type to use instead of the session_data
dataframe, and whether I need to be using .iterrows() at all.
Any help is appreciated! Please let me know if I need to add more information.
EDIT: Here's some more information to create a reproducible example.
To get data
, I'm linking to an external .csv with 100,000 rows. For convenience, here's a sample dataframe:
data = pd.DataFrame({'ID': ['1234', '5678', '5678', '1234'],
'timestamp': ['12/23/14 16:53', '12/23/14 16:50', '12/23/14 16:52', '12/23/14 17:20']})
I've created session_data
in the above snippet like so:
#create empty dataframe
session_data = pd.DataFrame(columns=['ID', 'session_start', 'session_duration'])
In the end, I want session data to look something like this:
user_id session_start session_duration
0 1234 12/23/14 16:53 27 minutes
1 5678 12/23/14 16:50 2 minutes