3

I have a canonical Pandas transform example in which performance seems inexplicably slow. I have read the Q&A on the apply method, which is related but, in my humble opinion, offers an incomplete and potentially misleading answer to my question as I explain below.

The first five lines of my dataframe are

    id      date        xvar
0   1004    1992-05-31  4.151628
1   1004    1993-05-31  2.868015
2   1004    1994-05-31  3.043287
3   1004    1995-05-31  3.189541
4   1004    1996-05-31  4.008760
  • There are 24,693 rows in the dataframe.
  • There are 2,992 unique id values.

I want to center xvar by id.

Approach 1 takes 861 ms:

df_r['xvar_center'] = (
    df_r
    .groupby('id')['xvar']
    .transform(lambda x: x - x.mean())
)

Approach 2 takes 9 ms:

# Group means
df_r_mean = (
    df_r
    .groupby('id', as_index=False)['xvar']
    .mean()
    .rename(columns={'xvar':'xvar_avg'})
)

# Merge group means onto dataframe and center
df_w = (
    pd
    .merge(df_r, df_r_mean, on='id', how='left')
    .assign(xvar_center=lambda x: x.xvar - x.xvar_avg)
)

The Q&A on the apply method recommends relying on vectorized functions whenever possible, much like @sammywemmy's comment implies. This I see as overlap. However, the Q&A on the apply method also sates:

"...here are some common situations where you will want to get rid of any calls to apply...Numeric Data"

@sammywemmy's comment does not "get rid of any calls to" the transform method in their answer to my question. On the contrary, the answer relies on the transform method. Therefore, unless @sammywemmy's suggestion is strictly dominated by an alternative approach that does not rely on the transform method, I think my question and its answer are sufficiently distinct from the discussion in Q&A on the apply method. (Thank you for your patience and help.)

MRR
  • 83
  • 6
  • 7
    Avoid the anonymous function and use ``df.xvar - df.groupby("id")["xvar"].transform("mean")`` – sammywemmy Aug 14 '20 at 03:43
  • 1
    @sammywemmy. Thank you very much. Your approach executes in 3 ms and is more Pythonic than any of mine. This was my first question so please let me know if I can do anything else to acknowledge your helpful (and rapid) response. – MRR Aug 14 '20 at 03:52
  • @MRR, it's fine. glad it worked for you. Just pay it forward by helping someone else. cheers – sammywemmy Aug 14 '20 at 03:53
  • I see the connection between my question and the [linked question](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code) but I'm surprised it is deemed duplicate. Is @sammywemmy's suggestion readily apparent from the answer to the other question? (I didn't think so.) – MRR Aug 14 '20 at 15:23

1 Answers1

2

This answer is due to the insightful comment from @sammywemmy, who deserves all credit and no blame for any inaccuracy here. Because a similar usage of transform is illustrated in the Pandas User's Guide, I thought elaborating may be useful for others.

My hypothesis is that the problem rests with a combination of using a non-vectorized function and a large number of groups. When I change the groupby variable from id (2,992 unique values) to year (constructed from the date variable and containing 28 unique values), the performance difference between my original approach and @sammywemmy's narrows substantially but is still significant.

%%timeit
df_r['xvar_center_y'] = (
    df_r
    .groupby('year')['xvar']
    .transform(lambda x: x - x.mean())
)
11.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs.

%timeit df_r['xvar_center_y'] = df_r.xvar - df_r.groupby('year')['xvar'].transform('mean')
1.69 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The beauty of @sammywemmy's insight is that it is easy to apply to other common transformations for potentially significant performance improvements and at a modest cost in terms of additional code. For example, consider standardizing a variable:

%%timeit
df_r['xvar_z'] = (
    df_r
    .groupby('id')['xvar']
    .transform(lambda x: (x - x.mean()) / x.std())
)
1.34 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs.

%%timeit 
df_r['xvar_z'] = (
    (df_r.xvar - df_r.groupby('id')['xvar'].transform('mean')) 
    / df_r.groupby('id')['xvar'].transform('std')
)
3.96 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
MRR
  • 83
  • 6
  • Btw, saving `df_r.groupby('id')['xvar']` to a variable beforehand will shave off some further time, since the grouping isn't done twice. – Dahn Nov 11 '21 at 11:33