1

This is a tricky question where I'm looking to improve code performance. Imagine a dataframe like this:

TOUR_ID  ID    PAGE_ID     CREATED DATE         AVAILABILITY    

T_1      ID1      P1      2018-07-03 19:10:19     AVAILABLE     
T_1      ID1      P1      2018-07-03 19:10:20     AVAILABLE     
T_1      ID1      P2      2018-07-03 19:12:33     AVAILABLE     
T_1      ID2      P3      2018-07-03 19:13:34     AVAILABLE 
T_1      ID2      P3      2018-07-03 19:13:35     NOT AVAILABLE     
T_1      ID2      P4      2018-07-03 19:16:24     AVAILABLE     

T_2      ID3      P4      2018-07-03 18:23:19     AVAILABLE       
T_2      ID3      P4      2018-07-03 18:23:20     NOT AVAILABLE   
T_2      ID1      P1      2018-07-03 19:10:21     NOT AVAILABLE     
T_2      ID2      P3      2018-07-03 19:13:37     NOT AVAILABLE 
T_2      ID2      P3      2018-07-03 19:13:38     NOT AVAILABLE     
T_2      ID3      P5      2018-07-03 20:56:33     AVAILABLE       
T_2      ID3      P5      2018-07-03 20:56:34     NOT AVAILABLE   
T_2      ID3      P5      2018-07-03 22:56:35     AVAILABLE       
T_2      ID3      P6      2018-07-03 22:57:20     NOT AVAILABLE   
T_2      ID3      P7      2018-07-03 22:58:35     AVAILABLE       
T_2      ID4      P8      2018-07-03 22:59:00     AVAILABLE     
T_2      ID1      P1      2018-07-03 23:12:00     AVAILABLE     
T_2      ID1      P3      2018-07-03 23:32:00     AVAILABLE         

On each group (Tour_ID, ID, Page_ID) I need to create a column with the last value of the previous group. Additionally, in the first time tour_ID, or ID, changes, I shall get NaNs, cuz there are not any previous values for that combination.

The outcome should look like this:

TOUR_ID   ID    PAGE_ID     CREATED DATE         AVAILABILITY   PREVIOUS AVAILABILITY    

T_1      ID1      P1      2018-07-03 19:10:19     AVAILABLE            NaN     
T_1      ID1      P1      2018-07-03 19:10:20     AVAILABLE            NaN
T_1      ID1      P2      2018-07-03 19:12:33     AVAILABLE         AVAILABLE
T_1      ID2      P3      2018-07-03 19:13:34     AVAILABLE            NaN
T_1      ID2      P3      2018-07-03 19:13:35     NOT_AVAILABLE        NaN
T_1      ID2      P4      2018-07-03 19:16:24     AVAILABLE       NOT_AVAILABLE       

T_2      ID3      P4      2018-07-03 18:23:19     AVAILABLE            NaN
T_2      ID3      P4      2018-07-03 18:23:20     NOT AVAILABLE        NaN
T_2      ID1      P1      2018-07-03 19:10:21     NOT AVAILABLE        NaN
T_2      ID2      P3      2018-07-03 19:13:37     NOT AVAILABLE        NaN
T_2      ID2      P3      2018-07-03 19:13:38     NOT AVAILABLE        NaN
T_2      ID3      P5      2018-07-03 20:56:33     AVAILABLE       NOT AVAILABLE
T_2      ID3      P5      2018-07-03 20:56:34     NOT AVAILABLE   NOT AVAILABLE
T_2      ID3      P5      2018-07-03 22:56:35     AVAILABLE       NOT AVAILABLE
T_2      ID3      P6      2018-07-03 22:57:20     NOT AVAILABLE     AVAILABLE
T_2      ID3      P7      2018-07-03 22:58:35     AVAILABLE       NOT AVAILABLE
T_2      ID4      P8      2018-07-03 22:59:00     AVAILABLE            NaN
T_2      ID1      P1      2018-07-03 23:12:00     AVAILABLE            NaN
T_2      ID1      P3      2018-07-03 23:32:00     AVAILABLE         AVAILABLE

I have a code that runs but it does not scale well (the dataframe has about 900,000rows). Any help in improving code performance would be duly appreciated.

Here's what I have so far:

for current_op in df.TOUR_ID.unique():    
    dummy = df[df.TOUR_ID == current_op].ID.unique()

    for current_ID in dummy:
        dummy_m = df[(df.TOUR_ID == current_op) & (df.ID == current_ID)].PAGE_ID.unique()

        for current_page in dummy_m:
            mask = (df.TOUR_ID == current_op) & (df.ID == current_ID) & (df.PAGE_ID == current_page)
            indexes = mask.reset_index().rename(columns ={0:'Bool'})
            ind = indexes.index[indexes['Bool'] == True].tolist()[0]

            if (ind == 0) | ((current_page == dummy_m[0])):
                df.loc[mask,'Previous_availability'] = np.nan
            else:
                previous_aval = df.AVAILABILITY.loc[indexes['index'].loc[ind-1]]

                df.loc[mask, 'Previous_availability'] = previous_aval

Note: The NaN's will be eventually dropped

-- Edit

Below is the code to create the dataframe:

 import pandas as pd 
 import numpy as np
 df = pd.DataFrame([['T_1','ID1','P1','2018-07-03 19:10:19', 'AVAILABLE'],
               ['T_1','ID1','P1','2018-07-03 19:10:20', 'AVAILABLE'],
               ['T_1','ID1','P2','2018-07-03 19:12:33', 'AVAILABLE'],

               ['T_1','ID2','P3','2018-07-03 19:13:34', 'AVAILABLE'],
               ['T_1','ID2','P3','2018-07-03 19:13:35', 'NOT AVAILABLE'],
               ['T_1','ID2','P4','2018-07-03 19:16:24', 'AVAILABLE'],

               ['T_2','ID3','P4','2018-07-03 18:23:19', 'AVAILABLE'],
               ['T_2','ID3','P4','2018-07-03 18:23:20', 'NOT AVAILABLE'],
               ['T_2','ID1','P1','2018-07-03 19:10:21', 'NOT AVAILABLE'],
               ['T_2','ID2','P3','2018-07-03 19:13:36', 'NOT AVAILABLE'],
               ['T_2','ID2','P3','2018-07-03 19:13:37', 'NOT AVAILABLE'],
               ['T_2','ID3','P5','2018-07-03 20:56:33', 'AVAILABLE'],
               ['T_2','ID3','P5','2018-07-03 20:56:34', 'NOT AVAILABLE'],
               ['T_2','ID3','P5','2018-07-03 22:56:35', 'AVAILABLE'],
               ['T_2','ID3','P6','2018-07-03 22:57:20', 'NOT AVAILABLE'],
               ['T_2','ID3','P7','2018-07-03 22:58:35', 'AVAILABLE'],
               ['T_2','ID4','P8','2018-07-03 22:59:00', 'AVAILABLE'],
               ['T_2','ID1','P1','2018-07-03 23:12:00', 'AVAILABLE'],
               ['T_2','ID1','P3','2018-07-03 23:32:00', 'AVAILABLE']

              ], columns=['TOUR_ID','ID','PAGE_ID','CREATED DATE', 'AVAILABILITY'])
  • 1
    How is the last row NAN? – Zero Aug 25 '18 at 06:16
  • ID4 has no prior values, so it should be NaN – Milo Ventimiglia Aug 25 '18 at 06:17
  • There's no T3. But in summary: T2, ID3, P5 are filled with the last value of T2, ID3, P4. T2, ID3, P4 has NaNs cuz there's no values for T2 since it has changed from T1. – Milo Ventimiglia Aug 25 '18 at 06:20
  • @MiloVentimiglia Can you add a that example dataframe as a reproducible example? See https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Andrey Portnoy Aug 25 '18 at 06:23
  • Again: ID and TOUR_ID are categories. If they change you don't have any prior values and should be set to NaN. Within a given TOUR_ID and ID, if there's a change in page_ID then yes, you can use the prior value. However, for a given TOUR_ID and ID, the first PAGE_ID (which does not have any prior value) should be set to NaN. – Milo Ventimiglia Aug 25 '18 at 06:24
  • Hi all, and thank you for your attention on this matter. To be more clear: the first page_ID of each TOUR_ID, ID group should be NaN because there are no prior values. – Milo Ventimiglia Aug 25 '18 at 07:07

2 Answers2

1

That was quite a headscratcher, but here's one way to solve this:

df = pd.read_csv('test.tsv').set_index(['TOUR_ID', 'ID', 'PAGE_ID'])

Get last row of each group, shift forward by one:

shifted = df.groupby(['TOUR_ID', 'ID', 'PAGE_ID']).last().shift(1).reset_index()

Now we are interested in rows where we see changes in PAGE_ID but not in ID, so we construct a boolean mask:

change = shifted != shifted.shift(1)
mask = np.array(change.PAGE_ID & ~change.ID & ~change.TOUR_ID)

Finally, we apply the mask and join to create new column:

shifted.set_index(['TOUR_ID', 'ID', 'PAGE_ID'], inplace=True)

shifted[~mask] = np.nan

result = df.join(shifted['AVAILABILITY'], rsuffix='LAST')
Andrey Portnoy
  • 1,430
  • 15
  • 24
  • Hi Andrey. Many thanks for you help, it is quiet a brilliant solution. Just to add one detail: I am also interested in changes in the ID. The dataframe I used as an example is a simplified version of what I have at hands. Specifically, the ID and PAGE_ID might repeat itself for a different TOUR_ID. (e.g. ID2 - P3, might appear on T_2 instances). Since there's no prior data for this instances, it should be set to NaNs. Is there a way to integrate this on your solution? – Milo Ventimiglia Aug 25 '18 at 08:12
  • @MiloVentimiglia In other words, we are interested in the cases where we don't observe a change in `TOUR_ID`, in addition to `ID`? Adding `~change.TOUR_ID` to the mask construction should account for that. I edited my reply accordingly. – Andrey Portnoy Aug 25 '18 at 08:46
  • @Andry Portnoy Great! It works, and is a very elegant solution! This solution actually deserves a "super like". Thanks once more! – Milo Ventimiglia Aug 25 '18 at 09:11
0

Ok here's my stab.

1) Create helper Series P_INT (integer part of PAGE_ID)

2) Create helper DataFrame df_last_availability with MultiIndex ['TOUR_ID', 'ID', 'P_INT']

3) Offset P_INT by 1

4) Reset the index of original df so that it matches df_last_availability. From here you can easily merge (using left join) the 2 DataFrames on index.

5) The last chained methods are just cleanup to return the dataframe back to its original shape - ie drop helper field and reset the index back to it's original.

df['P_INT'] = df.PAGE_ID.str.extract('(\d+)').astype(int)
df_last_availability = df.groupby(['TOUR_ID', 'ID', 'P_INT']).last()
df['P_INT'] = df.P_INT - 1

(df.set_index(['TOUR_ID', 'ID', 'P_INT'])
.merge(df_last_availability[['AVAILABILITY']], how='left',
       left_index=True, right_index=True, suffixes=('', '_PREV'))
.reset_index()    
.drop(['P_INT'], axis=1))
Chris Adams
  • 18,389
  • 4
  • 22
  • 39