I'm trying to calculate the time between a username's status changes from Original Install/ Reinstall to Uninstall (and vice versa) on an app. I've been programming in Python for around a year or so, so my functions are sometimes slower or not as optimal; I'm having a similar issue with this task.
I used the following loop to calculate the time difference between states. The major issue is that this function is horrendously slow (for the 700k rows of data I'm working with), and it doesn't return the all of the time intervals.
# Import modules
import pandas as pd
import numpy as np
# Load sample data
data = pd.read_csv('username_status_dates.csv').drop(columns='index')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%M-%d')
# Create dataframe to record time intervals
customerStatuses_df = pd.DataFrame(columns=['username', 'status1', 'status2', 'timeToUninstall', 'timeToReinstall'])
# Iterate over data, calculate time intervals (by username), and add to customerStatuses
for user in data:
# Get user sub data
user_df = data.loc[ customerStatuses_df['Username'] == user ]
dates = list(user_df['Date'])
status = list(user_df['Status'])
for val in np.arange(0, len(status) - 1):
# If status in installed (original or re)
if (status[val] == 'Original Install') or (status[val] == 'Reinstall'):
# If next status is uninstall (un)
if status[val + 1] == 'Uninstall':
# Calculate time interval between statuses
d_unin = dates[val + 1] - dates[val]
# Add to customerStatuses
customerStatuses_df = customerStatuses_df.append({'username': user, 'status1': status[val], 'status2': status[val+1], 'timeToUninstall': d_unin}, ignore_index=True)
# If status in not installed (un)
elif (status[val] == 'Uninstall'):
# If next status is installed (re)
if status[val + 1] == 'Reinstall':
# Calculate time interval between statuses
d_rein = dates[val + 1] - dates[val]
# Add to customerStatuses
customerStatuses_df = customerStatuses_df.append({'username': user, 'status1': status[val], 'status2': status[val+1], 'timeToReinstall': d_rein}, ignore_index=True)
sample data: username_status_dates.csv
Username Status Date
0 Joey74 Original Install 2019-01-20 00:09:00
1 Sam22 Original Install 2019-01-15 00:10:00
2 Chloe35 Uninstall 2019-01-30 00:10:00
3 Sam22 Uninstall 2019-01-01 00:11:00
4 Zoe44 Reinstall 2019-01-10 00:12:00
5 Joey74 Reinstall 2019-01-20 00:12:00
6 Shelby99 Original Install 2020-01-07 00:01:00
7 Joey74 Uninstall 2020-01-05 00:01:00
8 Joey74 Reinstall 2020-01-30 00:01:00
9 Susane11 Reinstall 2020-01-04 00:02:00
result from function: customerStatuses_df
username status1 status2 timeToUninstall timeToReinstall
0 Joey74 Reinstall Uninstall 349 days 23:49:00 NaN
1 Joey74 Uninstall Reinstall NaT 25 days 00:00:00
2 Sam22 Original Install Uninstall -14 days +00:01:00 NaN
What am I missing, here?