This is an example of my dataset, which is about online gaming. We have the session id qualifying the bet, the date when the bet occurred and the result of the bet (win-draw-lose):
e = {'session': ['1', '3', '1', '1', '3', '1', '2', '2', '1', '3', '3', '3', '3', '3', '2', '3', '3'],
'date': ['2018-01-01 00:17:05', '2018-01-01 00:30:35', '2018-01-01 00:19:05', '2018-01-03 00:55:22',
'2018-01-01 00:21:07', '2018-01-01 00:22:09', '2018-02-01 00:35:22',
'2018-01-01 00:22:17', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:25:09', '2018-01-01 00:17:01', '2018-02-01 00:31:16',
'2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-02-01 00:38:16'],
'status': ['win', 'loss', 'loss', 'draw', 'loss', 'win', 'loss', 'loss', 'win', 'draw', 'loss', 'loss', 'loss',
'win', 'draw', 'loss', 'loss']}
#create dataframe
df2 = pd.DataFrame(data=e)
#sort it by session and date
df2 = df2.sort_values(['session', 'date']).reset_index(drop=True)
df.head()
session date status
0 1 2018-01-01 00:17:05 win
1 1 2018-01-01 00:19:05 loss
2 1 2018-01-01 00:22:09 win
3 1 2018-01-01 00:25:11 win
4 1 2018-01-03 00:55:22 draw
My objective is to count for each session the maximum number of plays occurred without winning. This is what I did, following the advice from this SO post:
1.First, I created a column where win = 1, and the other values=0
m = {'win':1, 'loss':0, 'draw':0}
df2['status_num'] = df2.status.map(m)
session date status status_num
0 1 2018-01-01 00:17:05 win 1
1 1 2018-01-01 00:19:05 loss 0
2 1 2018-01-01 00:22:09 win 1
3 1 2018-01-01 00:25:11 win 1
4 1 2018-01-03 00:55:22 draw 0
2.For each session, I compute the days since the last win, and then append the result to a new dataframe:
#create list of sessions
plist = list(df2.session.unique())
final = pd.DataFrame()
for i in plist:
#slice the dataset by session
sess = df2.loc[df2['session'] == i]
#calculate the last win occurrence
sess['Last_win']= sess.groupby(sess.status_num.cumsum()).cumcount()
#append the result
final = final.append(sess)
final
session date status status_num Last_win
0 1 2018-01-01 00:17:05 win 1 0
1 1 2018-01-01 00:19:05 loss 0 1
2 1 2018-01-01 00:22:09 win 1 0
3 1 2018-01-01 00:25:11 win 1 0
4 1 2018-01-03 00:55:22 draw 0 1
3.Finally, I groupby to get the max sequence without any winning event per session:
last_win = final.groupby('session')['Last_win'].max().reset_index()
last_win
session Last_win
0 1 1
1 2 2
2 3 5
The code does what I needed, BUT it is not very performant, and as I have a large dataset I would like to find a better solution in terms of running time. i am quite sure that the bottleneck is the for loop, and the fact that for each iteration I do a groupby, but really I cannot think of another way. I also tried the approach suggested here but I am not interested in days.