This is a tricky question where I'm looking to improve code performance. Imagine a dataframe like this:
TOUR_ID ID PAGE_ID CREATED DATE AVAILABILITY
T_1 ID1 P1 2018-07-03 19:10:19 AVAILABLE
T_1 ID1 P1 2018-07-03 19:10:20 AVAILABLE
T_1 ID1 P2 2018-07-03 19:12:33 AVAILABLE
T_1 ID2 P3 2018-07-03 19:13:34 AVAILABLE
T_1 ID2 P3 2018-07-03 19:13:35 NOT AVAILABLE
T_1 ID2 P4 2018-07-03 19:16:24 AVAILABLE
T_2 ID3 P4 2018-07-03 18:23:19 AVAILABLE
T_2 ID3 P4 2018-07-03 18:23:20 NOT AVAILABLE
T_2 ID1 P1 2018-07-03 19:10:21 NOT AVAILABLE
T_2 ID2 P3 2018-07-03 19:13:37 NOT AVAILABLE
T_2 ID2 P3 2018-07-03 19:13:38 NOT AVAILABLE
T_2 ID3 P5 2018-07-03 20:56:33 AVAILABLE
T_2 ID3 P5 2018-07-03 20:56:34 NOT AVAILABLE
T_2 ID3 P5 2018-07-03 22:56:35 AVAILABLE
T_2 ID3 P6 2018-07-03 22:57:20 NOT AVAILABLE
T_2 ID3 P7 2018-07-03 22:58:35 AVAILABLE
T_2 ID4 P8 2018-07-03 22:59:00 AVAILABLE
T_2 ID1 P1 2018-07-03 23:12:00 AVAILABLE
T_2 ID1 P3 2018-07-03 23:32:00 AVAILABLE
On each group (Tour_ID, ID, Page_ID) I need to create a column with the last value of the previous group. Additionally, in the first time tour_ID, or ID, changes, I shall get NaNs, cuz there are not any previous values for that combination.
The outcome should look like this:
TOUR_ID ID PAGE_ID CREATED DATE AVAILABILITY PREVIOUS AVAILABILITY
T_1 ID1 P1 2018-07-03 19:10:19 AVAILABLE NaN
T_1 ID1 P1 2018-07-03 19:10:20 AVAILABLE NaN
T_1 ID1 P2 2018-07-03 19:12:33 AVAILABLE AVAILABLE
T_1 ID2 P3 2018-07-03 19:13:34 AVAILABLE NaN
T_1 ID2 P3 2018-07-03 19:13:35 NOT_AVAILABLE NaN
T_1 ID2 P4 2018-07-03 19:16:24 AVAILABLE NOT_AVAILABLE
T_2 ID3 P4 2018-07-03 18:23:19 AVAILABLE NaN
T_2 ID3 P4 2018-07-03 18:23:20 NOT AVAILABLE NaN
T_2 ID1 P1 2018-07-03 19:10:21 NOT AVAILABLE NaN
T_2 ID2 P3 2018-07-03 19:13:37 NOT AVAILABLE NaN
T_2 ID2 P3 2018-07-03 19:13:38 NOT AVAILABLE NaN
T_2 ID3 P5 2018-07-03 20:56:33 AVAILABLE NOT AVAILABLE
T_2 ID3 P5 2018-07-03 20:56:34 NOT AVAILABLE NOT AVAILABLE
T_2 ID3 P5 2018-07-03 22:56:35 AVAILABLE NOT AVAILABLE
T_2 ID3 P6 2018-07-03 22:57:20 NOT AVAILABLE AVAILABLE
T_2 ID3 P7 2018-07-03 22:58:35 AVAILABLE NOT AVAILABLE
T_2 ID4 P8 2018-07-03 22:59:00 AVAILABLE NaN
T_2 ID1 P1 2018-07-03 23:12:00 AVAILABLE NaN
T_2 ID1 P3 2018-07-03 23:32:00 AVAILABLE AVAILABLE
I have a code that runs but it does not scale well (the dataframe has about 900,000rows). Any help in improving code performance would be duly appreciated.
Here's what I have so far:
for current_op in df.TOUR_ID.unique():
dummy = df[df.TOUR_ID == current_op].ID.unique()
for current_ID in dummy:
dummy_m = df[(df.TOUR_ID == current_op) & (df.ID == current_ID)].PAGE_ID.unique()
for current_page in dummy_m:
mask = (df.TOUR_ID == current_op) & (df.ID == current_ID) & (df.PAGE_ID == current_page)
indexes = mask.reset_index().rename(columns ={0:'Bool'})
ind = indexes.index[indexes['Bool'] == True].tolist()[0]
if (ind == 0) | ((current_page == dummy_m[0])):
df.loc[mask,'Previous_availability'] = np.nan
else:
previous_aval = df.AVAILABILITY.loc[indexes['index'].loc[ind-1]]
df.loc[mask, 'Previous_availability'] = previous_aval
Note: The NaN's will be eventually dropped
-- Edit
Below is the code to create the dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([['T_1','ID1','P1','2018-07-03 19:10:19', 'AVAILABLE'],
['T_1','ID1','P1','2018-07-03 19:10:20', 'AVAILABLE'],
['T_1','ID1','P2','2018-07-03 19:12:33', 'AVAILABLE'],
['T_1','ID2','P3','2018-07-03 19:13:34', 'AVAILABLE'],
['T_1','ID2','P3','2018-07-03 19:13:35', 'NOT AVAILABLE'],
['T_1','ID2','P4','2018-07-03 19:16:24', 'AVAILABLE'],
['T_2','ID3','P4','2018-07-03 18:23:19', 'AVAILABLE'],
['T_2','ID3','P4','2018-07-03 18:23:20', 'NOT AVAILABLE'],
['T_2','ID1','P1','2018-07-03 19:10:21', 'NOT AVAILABLE'],
['T_2','ID2','P3','2018-07-03 19:13:36', 'NOT AVAILABLE'],
['T_2','ID2','P3','2018-07-03 19:13:37', 'NOT AVAILABLE'],
['T_2','ID3','P5','2018-07-03 20:56:33', 'AVAILABLE'],
['T_2','ID3','P5','2018-07-03 20:56:34', 'NOT AVAILABLE'],
['T_2','ID3','P5','2018-07-03 22:56:35', 'AVAILABLE'],
['T_2','ID3','P6','2018-07-03 22:57:20', 'NOT AVAILABLE'],
['T_2','ID3','P7','2018-07-03 22:58:35', 'AVAILABLE'],
['T_2','ID4','P8','2018-07-03 22:59:00', 'AVAILABLE'],
['T_2','ID1','P1','2018-07-03 23:12:00', 'AVAILABLE'],
['T_2','ID1','P3','2018-07-03 23:32:00', 'AVAILABLE']
], columns=['TOUR_ID','ID','PAGE_ID','CREATED DATE', 'AVAILABILITY'])