I need to fill missing values in a pandas DataFrame by the mean value in each group. According to this question transform
can achieve this.
However, transform
is too slow for my purposes.
For example, take the following setting with a large DataFrame with 100 different groups and 70% NaN
values:
import pandas as pd
import numpy as np
size = 10000000 # DataFrame length
ngroups = 100 # Number of Groups
randgroups = np.random.randint(ngroups, size=size) # Creation of groups
randvals = np.random.rand(size) * randgroups * 2 # Random values with mean like group number
nan_indices = np.random.permutation(range(size)) # NaN indices
nanfrac = 0.7 # Fraction of NaN values
nan_indices = nan_indices[:int(nanfrac*size)] # Take fraction of NaN indices
randvals[nan_indices] = np.NaN # Set NaN values
df = pd.DataFrame({'value': randvals, 'group': randgroups}) # Create data frame
Using transform
via
df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long
takes already more than 3 seconds on my computer. I need something by an order of magnitude faster (buying a bigger machine is not an option :-D).
So how can I fill the missing values any faster?