1

This is an example of my dataset, which is about online gaming. We have the session id qualifying the bet, the date when the bet occurred and the result of the bet (win-draw-lose):

e = {'session': ['1', '3', '1', '1', '3', '1', '2', '2', '1', '3',  '3', '3', '3', '3',  '2', '3', '3'],
    'date': ['2018-01-01 00:17:05', '2018-01-01 00:30:35', '2018-01-01 00:19:05', '2018-01-03 00:55:22',
             '2018-01-01 00:21:07', '2018-01-01 00:22:09', '2018-02-01 00:35:22', 
             '2018-01-01 00:22:17',  '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
              '2018-01-01 00:25:09',   '2018-01-01 00:17:01', '2018-02-01 00:31:16',  
             '2018-02-01 00:38:20', '2018-02-01 00:55:15',  '2018-02-01 00:38:16'], 
    'status': ['win', 'loss', 'loss', 'draw', 'loss', 'win', 'loss', 'loss', 'win', 'draw', 'loss', 'loss', 'loss', 
               'win', 'draw', 'loss', 'loss']}

#create dataframe
df2 = pd.DataFrame(data=e)
#sort it by session and date
df2 = df2.sort_values(['session', 'date']).reset_index(drop=True)
df.head()

 session      date           status
0   1   2018-01-01 00:17:05   win
1   1   2018-01-01 00:19:05   loss
2   1   2018-01-01 00:22:09   win
3   1   2018-01-01 00:25:11   win
4   1   2018-01-03 00:55:22   draw

My objective is to count for each session the maximum number of plays occurred without winning. This is what I did, following the advice from this SO post:

1.First, I created a column where win = 1, and the other values=0

m = {'win':1, 'loss':0, 'draw':0}
df2['status_num'] = df2.status.map(m)

session      date            status   status_num
0   1   2018-01-01 00:17:05  win        1
1   1   2018-01-01 00:19:05  loss       0
2   1   2018-01-01 00:22:09  win        1
3   1   2018-01-01 00:25:11  win        1
4   1   2018-01-03 00:55:22  draw       0

2.For each session, I compute the days since the last win, and then append the result to a new dataframe:

#create list of sessions
plist = list(df2.session.unique())

final = pd.DataFrame()
for i in plist:
    #slice the dataset by session
    sess = df2.loc[df2['session'] == i]
    #calculate the last win occurrence
    sess['Last_win']= sess.groupby(sess.status_num.cumsum()).cumcount()
    #append the result
    final = final.append(sess)

final
session date                status status_num   Last_win
0   1   2018-01-01 00:17:05  win        1         0
1   1   2018-01-01 00:19:05  loss       0         1
2   1   2018-01-01 00:22:09  win        1         0
3   1   2018-01-01 00:25:11  win        1         0
4   1   2018-01-03 00:55:22  draw       0         1

3.Finally, I groupby to get the max sequence without any winning event per session:

last_win = final.groupby('session')['Last_win'].max().reset_index()
last_win
session Last_win
0   1     1
1   2     2
2   3     5

The code does what I needed, BUT it is not very performant, and as I have a large dataset I would like to find a better solution in terms of running time. i am quite sure that the bottleneck is the for loop, and the fact that for each iteration I do a groupby, but really I cannot think of another way. I also tried the approach suggested here but I am not interested in days.

EAMC
  • 45
  • 5

1 Answers1

0

First I would define a function to calculate the difference for one session like so, NOTE: this assumes you have sorted the dataframe in the way you have.

def events_to_last_win(df):

    vec = []
    counter=0
    for i in df.index:

        if df.loc[i,'status'] == 'win':
            counter = 0
        else: 
            counter += 1


        vec.append(counter)
    return vec

Then do the initial set up:

import pandas as pd
e = {'session': ['1', '3', '1', '1', '3', '1', '2', '2', '1', '3',  '3', '3', '3', '3',  '2', '3', '3'],
    'date': ['2018-01-01 00:17:05', '2018-01-01 00:30:35', '2018-01-01 00:19:05', '2018-01-03 00:55:22',
             '2018-01-01 00:21:07', '2018-01-01 00:22:09', '2018-02-01 00:35:22', 
             '2018-01-01 00:22:17',  '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
              '2018-01-01 00:25:09',   '2018-01-01 00:17:01', '2018-02-01 00:31:16',  
             '2018-02-01 00:38:20', '2018-02-01 00:55:15',  '2018-02-01 00:38:16'], 
    'status': ['win', 'loss', 'loss', 'draw', 'loss', 'win', 'loss', 'loss', 'win', 'draw', 'loss', 'loss', 'loss', 
               'win', 'draw', 'loss', 'loss']}


df = pd.DataFrame(data=e)
#sort it by session and date
df = df.sort_values(['session', 'date']).reset_index(drop=True)

Then we can assign the Last_win column a dummy value and update it by session id using the function we defined:

df['Last_win'] = 0
for session in df.session.unique()):
    df.loc[df.session == session,'Last_win'] = events_to_last_win( df[df.session == session])

df.groupby('session')['Last_win'].max().reset_index()

The average run time for this method over 1000 runs on my machine is 0.0153894310000004

Compared to the average over the 1000 runs of the method posted in the question: 0.19408468899999962

Note I haven't checked the values that come using this approach, but this should outline a much faster way to solve the problem

B.C
  • 577
  • 3
  • 18