0

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:

UserID Time Term QuestionID Answer
1 2019-12-28 18:25:15 Winter19 345 a
2 2019-12-29 20:15:13 Winter19 734 b

I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:

UserID Term HighestStreak
1 Winter19 7
2 Winter19 10

For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:

for userid, term in zip(appData.userid, appData.term):
    final_streak = 1
    for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
        temp_streak = 1
        while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
            i += pd.DateOffset(days=1)
            temp_streak += 1
        if temp_streak > final_streak:
            final_streak = temp_streak
    appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak

Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?

Any help would be highly appreciated!

EDIT:

I managed to increase the speed from 45 minutes to 2 minutes with the following change:

I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):

appData.loc[appData.totAnswers==0, 'highestStreak'] = 0 
appDataActive = appData[appData.totAnswers!=0]

Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:

appData.loc[appData.totAnswers==0, 'highestStreak'] = 0 
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
    activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
    final_streak = 1
    for day in activeDays:
        temp_streak = 1
        while day + pd.DateOffset(days=1) in activeDays:
            day += pd.DateOffset(days=1)
            temp_streak += 1
        if temp_streak > final_streak:
            final_streak = temp_streak
    appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak

Of course, 2 minutes is much better than 45 minutes. But are there any more tips?

mrvideo
  • 39
  • 7
  • 1
    Well, I'm not familiar with the syntax here, but it looks like on lines 3 and 5 you may be doing the same search for `answers[(answers.userid==userid) & (answers.term==term)]`. Maybe that could be done once? – xdhmoore Feb 18 '21 at 20:11
  • 2
    you're taking the wrong approach by looping through the data, essentially three times making your solution O(n)^3. you just happened to pick an inefficient solution. what you should be doing is grouping the data by user id and sorting by time asc. make sure that time is coerced to a date, not timestamp. then you join this table to itself on both user id and time = time + 1. this should give you enough direction to get started – gold_cy Feb 18 '21 at 20:11
  • @xdhmoore thank you for your advice. This lead me to the new code I added in my EDIT above. Do you have any more tips? – mrvideo Feb 20 '21 at 21:02
  • @gold_cy Do you think that your solution would lead to a better performance than my improved code above? – mrvideo Feb 20 '21 at 21:03
  • 1
    I don't really understand the data well enough to make more suggestions, other than the addage "premature optimization is the root of all evil". In other words, once youve fixed obvious performance issues, it's good to try a profiler to see where the time is being taken up. Sometimes people spend a lot of time making performance "improvements" that are really just guesses in the dark. So at some point after obvious fixes that might be the way to go if you really want to squeeze it down as fast as you can get it. – xdhmoore Feb 20 '21 at 21:24
  • 2
    it might but I echo @xdhmoore comments about not knowing enough about the data to make more suggestions. basically what you’re looking for is doing a window function in pandas – gold_cy Feb 20 '21 at 21:37
  • @xdhmoore Thank you for your comment. I am sorry, but I am new to this topic. What is a 'profiler' exactly and how can I use it? – mrvideo Feb 21 '21 at 08:40
  • @gold_cy What kind of details do you exactly need to be able to make more suggestions? The data types? I just had a quick look at the window function and it sounds promising. Thank you very much. – mrvideo Feb 21 '21 at 08:41
  • Basically, a profiler is a program that runs your program and tells you how much time different functions are taking. See https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script?rq=1 The only one I've tried with Python is SnakeViz, which seemed friendly enough though it was a while ago. – xdhmoore Feb 21 '21 at 08:45
  • But idk. I think @gold_cy's suggestions about using built-in pandas stuff are probably more worth your time to try first. – xdhmoore Feb 21 '21 at 08:46

1 Answers1

0

my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs

first I create a random DataFrame with some user id's and some dates.

import datetime
import random

import pandas
import numpy

#generate basic dataframe of users and answer dates
def sample_data_frame():
    users = ['A' + str(x) for x in range(10000)] #generate user id
    date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
                           name='date')
    users = pandas.Series(users, name='user')
    df = pandas.merge(date_range, users, how='cross')
    removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
    df.drop(removals, inplace=True)
    return df
def sample_data_frame_v2(): #pandas version <1.2
    users = ['A' + str(x) for x in range(10000)] #generate user id
    date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
    users = pandas.DataFrame(users, columns = ['user'])
    date_range['key'] = 1
    users['key'] = 1
    df = users.merge(date_range, on='key')
    df.drop(labels = 'key', axis = 1)
    removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
    df.drop(removals, inplace=True)
    return df
  • put your DataFrame in sorted order, so that the next row is next answer day and then by user
  • create two new columns from the row below containing the userid and the date of the row below
  • if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
  • cumulatively sum the results which will group your streaks
  • finally count how many entries exist per group and find the max for each user

for 10k users over 364 days worth of answers my running time is about a 1 second

df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()

df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()

summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))

summary.sort_values(ascending = False) #print user with highest streak
el_oso
  • 1,021
  • 6
  • 10
  • Thank you very much for your effort. Unfortunately I receive the following error when I try to run your code on my machine: `MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False` What am I doing wrong? – mrvideo Mar 03 '21 at 08:08
  • simple fix - update to latest version of pandas, 'cross' is a fairly recent addition to the merge method. alternatively, you can create a new column with a common key to join on, in my example I've created a column 'key' – el_oso Mar 03 '21 at 18:41