If the data contains many groups (thousands or more), the accepted answer using a lambda may take a very long time to compute. A fast solution would be:
groups = df.groupby("indx")
mean, std = groups.transform("mean"), groups.transform("std")
normalized = (df[mean.columns] - mean) / std
Explanation and benchmarking
The accepted answer suffers from a performance problem using apply with a lambda. Even though groupby.transform
itself is fast, as are the already vectorized calls in the lambda function (.mean()
, .std()
and the subtraction), the call to the pure Python lambda function itself for each group creates a considerable overhead.
This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.
We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform
. Let's put the solution from above into a method for benchmarking:
def normalize_by_group(df, by):
groups = df.groupby(by)
# computes group-wise mean/std,
# then auto broadcasts to size of group chunk
mean = groups.transform("mean")
std = groups.transform("std")
normalized = (df[mean.columns] - mean) / std
return normalized
I changed the data generation from the original question to allow for more groups:
def gen_data(N, num_groups):
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
indx = np.random.randint(0,num_groups,size=N).astype(np.int32)
df = pd.DataFrame(np.hstack((data, indx[:,None])),
columns=['a%s' % k for k in range(m)] + [ 'indx'])
return df
With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:
In: df2g = gen_data(10000, 2) # 3 cols, 10000 rows, 2 groups
In: %timeit normalize_by_group(df2g, "indx")
6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:
In: df1000g = gen_data(10000, 1000) # 3 cols, 10000 rows, 1000 groups
In: %timeit normalize_by_group(df1000g, "indx")
7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)