1

I'm trying to calculate the time between a username's status changes from Original Install/ Reinstall to Uninstall (and vice versa) on an app. I've been programming in Python for around a year or so, so my functions are sometimes slower or not as optimal; I'm having a similar issue with this task.

I used the following loop to calculate the time difference between states. The major issue is that this function is horrendously slow (for the 700k rows of data I'm working with), and it doesn't return the all of the time intervals.

# Import modules
import pandas as pd
import numpy as np

# Load sample data
data = pd.read_csv('username_status_dates.csv').drop(columns='index')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%M-%d')

# Create dataframe to record time intervals
customerStatuses_df = pd.DataFrame(columns=['username', 'status1', 'status2', 'timeToUninstall', 'timeToReinstall'])

# Iterate over data, calculate time intervals (by username), and add to customerStatuses
for user in data:

    # Get user sub data
    user_df = data.loc[ customerStatuses_df['Username'] == user ]
    dates = list(user_df['Date'])
    status = list(user_df['Status'])

    for val in np.arange(0, len(status) - 1):

        # If status in installed (original or re)
        if (status[val] == 'Original Install') or (status[val] == 'Reinstall'):

            # If next status is uninstall (un)
            if status[val + 1] == 'Uninstall':

                # Calculate time interval between statuses
                d_unin = dates[val + 1] - dates[val]

                # Add to customerStatuses
                customerStatuses_df = customerStatuses_df.append({'username': user, 'status1': status[val], 'status2': status[val+1], 'timeToUninstall': d_unin}, ignore_index=True)

        # If status in not installed (un)      
        elif (status[val] == 'Uninstall'):

            # If next status is installed (re)
            if status[val + 1] == 'Reinstall':

                # Calculate time interval between statuses
                d_rein = dates[val + 1] - dates[val]

                # Add to customerStatuses
                 customerStatuses_df = customerStatuses_df.append({'username': user, 'status1': status[val], 'status2': status[val+1], 'timeToReinstall': d_rein}, ignore_index=True)

sample data: username_status_dates.csv

   Username         Status                  Date
0   Joey74      Original Install    2019-01-20 00:09:00
1   Sam22       Original Install    2019-01-15 00:10:00
2   Chloe35     Uninstall           2019-01-30 00:10:00
3   Sam22       Uninstall           2019-01-01 00:11:00
4   Zoe44       Reinstall           2019-01-10 00:12:00
5   Joey74      Reinstall           2019-01-20 00:12:00
6   Shelby99    Original Install    2020-01-07 00:01:00
7   Joey74      Uninstall           2020-01-05 00:01:00
8   Joey74      Reinstall           2020-01-30 00:01:00
9   Susane11    Reinstall           2020-01-04 00:02:00

result from function: customerStatuses_df

username    status1 status2 timeToUninstall timeToReinstall
0   Joey74  Reinstall   Uninstall   349 days 23:49:00   NaN
1   Joey74  Uninstall   Reinstall   NaT 25 days 00:00:00
2   Sam22   Original Install    Uninstall   -14 days +00:01:00  NaN

What am I missing, here?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241

1 Answers1

2

A colleague and I realized that the core issue with this loop is that it attempts to iterate over a dataframe, itself, which is disproportionately slow when compared to other methods of combing through dataframes. This partially explained why my function (above) was so slow when using a larger dataset. We also found that working with vectorized numbers is much faster than working with the Date column's timedelta-format.

enter image description here

We found a great StackOverflow source that shows how to convert (or implode) dataframes into dictionaries in order to replace the need to iterate over the dataframe itself. To answer the question above, I organized the data by username, vectorized the Date column, and converted the dataframe into a dictionary, then calculated the time between state changes:

# Sort by username, then date
customerStatuses_df = customerStatuses_df.sort_values(by=['Username', 'Date'])

# Vectorize the Date column with .diff()
vectorized_time = customerStatuses_df['Date'].diff()
customerStatuses_df['vectorizedTime'] = pd.Series(vectorized_time)

# Implode the dataframe by making it a dictionary
customerStatuses_dict = customerStatuses_df.to_dict('index')

# Create lists to form a final dataframe
username = []
status_1 = []
status_2 = []
time_change = []

# Track timing of function
from tqdm import tqdm

# Determine difference between states, by user
imploded_dict = {}

for key, value in tqdm(list(customerStatuses_dict.items())):

    # Grab all install Statuses and vectorizedTime attributed to the Username 'key'
    try:
        # Python prefers 'ask forgiveness, not permission' -> attempt to append to list
        imploded_dict[value['Username']].append((value['Status'], value['vectorizedTime']))

    except KeyError:  # but if list does not exist, create the list
        imploded_dict[value['Username']] = [(value['Status'], value['vectorizedTime'])]

# Using the now-imploded data, calculate time change between states
for user_id, status_tuple_list in tqdm(list(imploded_dict.items())):

    # Iterate key value pairs in dictionary
    for idx, (status, time_delta) in enumerate(status_tuple_list): 
        if idx == 0:
            continue

        # Add to lists for final dataframe
        username.append(user_id)
        status_1.append(status_tuple_list[idx - 1][0])
        status_2.append(status_tuple_list[idx][0])
        time_change.append(time_delta)


# Create dataframe to hold time changes
results_df = pd.DataFrame(list(zip(username, status_1, status_2, time_change)), columns=['username', 'status1', 'status2', 'timeToChange'])

results_df

    username     status1             status2            timeToChange
0   Joey74      Original Install    Reinstall          0 days 00:03:00
1   Joey74      Reinstall           Uninstall          349 days 23:49:00
2   Joey74      Uninstall           Reinstall          25 days 00:00:00
3   Sam22       Uninstall           Original Install   13 days 23:59:00