0

I have the following dataframe

   location  tps_inter  sess_glob
0      loc1          0          0
1      loc1         79          0
2      loc1          3          0
3      loc1         17          0
4      loc2          0          0
5      loc2         46          0
6      loc3          0          0

I would like to groupby by location and add 1 for the first row of each group.

   location  tps_inter  sess_glob
0      loc1          0          1
1      loc1         79          0
2      loc1          3          0
3      loc1         17          0
4      loc2          0          1
5      loc2         46          0
6      loc3          0          1

Then for each group, I want to add a index depending on the value of tps_inter. If tps_inter is less than 10, sess_glob should be the same value as before, if it's greater than 10, same value + 1.

The desired result is

   location  tps_inter  sess_glob
0      loc1          0          1
1      loc1         79          2
2      loc1          3          2
3      loc1         17          3
4      loc2          0          1
5      loc2         46          2
6      loc3          0          1

This code is working but it become very slow when the number of rows increase

df1 = df.copy()
df1 = df1.iloc[0:0]
gdf = df.groupby('location')
    i = 1
    for table, group in gdf:
        for row, data in group.iterrows():       
            if data["tps_inter"] > 10 :
                i = i + 1        
            data['sess_glob'] = i
            df1 = pd.concat([df1, data.to_frame().T])
        i = 1

I think there is a better way to do it without the concatenation line but I can't find it. The main problem I have is to get the result in Dataframe and not in series.

( I used the following question to write my code How to loop over grouped Pandas dataframe? )

latmos
  • 11
  • 1
  • The code inevitably becomes slow because here ```df1 = pd.concat([df1, data.to_frame().T])``` you are creating a new DataFrame every time you enter the loop. Try to refactor that bit of code to only dump the updated values in one new DataFrame at the end (maybe store them in a dictionary since that plays nicely with pandas?) – Luca Giorgi Feb 18 '19 at 16:03

1 Answers1

5

No need to loop, what you need can be accomplished with .cumsum, after first defining the column over which we will sum:

import numpy as np

df['sess_glob'] = (df.assign(to_csum = np.where(df['tps_inter'].lt(10), 0, 1))
                     .groupby('location').to_csum.cumsum()+1)
#or 
#df['sess_glob'] = (df.assign(to_csum = df['tps_inter'].ge(10))
#                     .groupby('location').to_csum.cumsum()+1).astype(int)

  location  tps_inter  sess_glob
0     loc1          0          1
1     loc1         79          2
2     loc1          3          2
3     loc1         17          3
4     loc2          0          1
5     loc2         46          2
6     loc3          0          1
ALollz
  • 57,915
  • 7
  • 66
  • 89