Python Pandas - groupby() with apply() & rolling() very slow

Question

First off, I am fairly new to Python & Pandas, so please be patient and reply in as simple terms as possible. Also, if you could elaborate on any code that is different than what I have in my sample or point me to a solid reference that would make it easy to understand, I would greatly appreciate it.

I have a dataframe (df1) of monthly data with 60+ columns & 800k rows (& growing) for 6000+ locations. I am trying to calculate the rolling mean (3 mo, 12 mo, YTD, etc) based on the location license number ('lic_num', int), month ('mo_yr', date). I have been successful in doing this using apply(). The problem is that apply() feels very slow taking 10 min. This isn't a major issue for this project because this wont be something that will need to be run on demand, but I want to become more efficient at writing code similar to this in the case where I need a project to execute faster. Here is a sample of the dataframe (df1) and my code that I use to achieve my results

lic_num      mo_yr        ap         aw       fi
120700142 2013-03-01  228214.3  206273.53  61393.0
120700142 2013-04-01  256239.4  235296.96  64228.0
120700142 2013-05-01  247725.3  227165.09  74978.0
120700142 2013-06-01  229776.8  211765.55  64559.0
120700142 2013-07-01  229036.2  210963.06  58132.0

df1_col_list = df1.columns.tolist()

for col in df1_col_list[2:5]:
    df1[col+'_3mo'] = df1.groupby('lic_num', as_index=False).apply(
    lambda x: x.rolling(3, on='mo_yr', min_periods=1)[col].mean()).reset_index(level=0, drop=True)

lic_num      mo_yr        ap         aw       fi         ap_3mo         aw_3mo        fi_3mo
120700142 2013-03-01  228214.3  206273.53  61393.0  228214.300000  206273.530000  61393.000000
120700142 2013-04-01  256239.4  235296.96  64228.0  242226.850000  220785.245000  62810.500000
120700142 2013-05-01  247725.3  227165.09  74978.0  244059.666667  222911.860000  66866.333333
120700142 2013-06-01  229776.8  211765.55  64559.0  244580.500000  224742.533333  67921.666667
120700142 2013-07-01  229036.2  210963.06  58132.0  235512.766667  216631.233333  65889.666667

`apply` and explicit iteration should be a last resort in Pandas. — AMC, Nov 23 '19 at 19:01

score 4 · Accepted Answer · answered Nov 23 '19 at 18:24

4

If apply is slow , we try not use it. Here is more info about the reason why apply is slow When should I ever want to use pandas apply() in my code?

s=df.groupby('lic_num', as_index=False).\
       rolling(3, on='mo_yr', min_periods=1).\       
          mean().iloc[:,2:5].\
             add_suffix('_3mo').reset_index(drop=True,level=0)

df=pd.concat([df,s],axis=1)

answered Nov 23 '19 at 18:24

BENY

317,841
20
164
234

Thank you. I appreciate the quick response. Once I get back home, I will try it and update. Again, thanks. – Jack Valadez Nov 23 '19 at 22:23
Tried the proposed solution and received the following error. "NotImplementedError: ops for Rolling for this dtype datetime64[ns] are not implemented" – Jack Valadez Nov 23 '19 at 23:22
@JackValadez there is no error on my side , could you please check your pandas version ? – BENY Nov 23 '19 at 23:39
Ok. The code is working on the sample dataframe I originally applied, but when trying to use it on the larger dataframe, I am getting errors. Going to look deeper and see what I can find out. Thank you. – Jack Valadez Nov 24 '19 at 00:04

Python Pandas - groupby() with apply() & rolling() very slow

1 Answers1