8

I need to fill missing values in a pandas DataFrame by the mean value in each group. According to this question transform can achieve this.

However, transform is too slow for my purposes.

For example, take the following setting with a large DataFrame with 100 different groups and 70% NaN values:

import pandas as pd
import numpy as np

size = 10000000  # DataFrame length
ngroups = 100  # Number of Groups

randgroups = np.random.randint(ngroups, size=size)  # Creation of groups
randvals = np.random.rand(size) * randgroups * 2    # Random values with mean like group number
nan_indices = np.random.permutation(range(size))    # NaN indices
nanfrac = 0.7                                       # Fraction of NaN values
nan_indices = nan_indices[:int(nanfrac*size)]       # Take fraction of NaN indices
randvals[nan_indices] = np.NaN                      # Set NaN values

df = pd.DataFrame({'value': randvals, 'group': randgroups})  # Create data frame

Using transform via

df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long

takes already more than 3 seconds on my computer. I need something by an order of magnitude faster (buying a bigger machine is not an option :-D).

So how can I fill the missing values any faster?

Cœur
  • 37,241
  • 25
  • 195
  • 267
SmCaterpillar
  • 6,683
  • 7
  • 42
  • 70
  • Is it an option to deal with the missing data before you read it into a frame? – Batman Nov 18 '16 at 17:29
  • Hmm, I'm not sure. I would prefer not to, because the real DataFrame comes from a SQL query (and is actually a few GB in size). – SmCaterpillar Nov 18 '16 at 17:31
  • 1
    I'd look at doing it there then. I wouldn't be surprised if SQL was able to calculate the mean faster than what Pandas is. – Batman Nov 18 '16 at 17:37

3 Answers3

3

you're doing it wrong. it's slow because you're using a lambda

df[['value']].fillna(df.groupby('group').transform('mean'))
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Ah I see, maybe you should post this also as an answer to the original question, there they suggested the use of `lambda`. – SmCaterpillar Nov 18 '16 at 17:11
  • 1
    Still, your solution is "only" 20% faster, but that does not come near an order of magnitude faster :-) – SmCaterpillar Nov 18 '16 at 17:13
  • 1
    @SmCaterpillar With a pandas solution I doubt you can get a significant improvement on this. Most of the time is spent on calculating the mean. `df['value'].fillna(df.groupby('group', sort=False)['value'].transform('mean'))` is a little faster. – ayhan Nov 18 '16 at 17:21
  • on my machine, piRSquared answer is about 6x faster. `transform` is often way faster if you use a native function like `mean` (cythonized) rather than a lambda (not cythonized) – JohnE Nov 18 '16 at 19:45
  • @SmCaterpillar make sure you are using a recent version of pandas -- it could effect the speed of transform. https://github.com/pandas-dev/pandas/issues/12737 – JohnE Nov 18 '16 at 20:20
3

Using a Sorted Index + fillna()

You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78s to run.

  1. Example Code: %%timeit df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean())) Output: 1 loop, best of 3: 3.18 s per loop`

  2. piRSquared's improvement: %%timeit df[['value']].fillna(df.groupby('group').transform('mean')) Output: 1 loop, best of 3: 2.78 s per loop

  3. Slightly more efficient way (using a sorted index and fillna):

You can set the group column as the index of the dataframe, and sort it.

df = df.set_index('group').sort_index()

Now that you have a sorted index, the it's super cheap to access a subset of the dataframe by the group number, by using df.loc[x,:]

Since you need to impute by the mean for every group, you need all the unique group id's. For this example, you could use range (since the groups are from 0 to 99), but more generally- you can use:

groups = np.unique(set(df.index))

After this, you can iterate over the groups and use fillna() for imputation: %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value'])) Output: 1 loop, best of 3: 231 ms per loop

Note: set_index, sort_index and np.unique operations are a one time cost. To be fair to everyone, the total time (including these operations) was 2.26s on my machine, but the imputation piece took only 231 ms.

Shivam Gaur
  • 1,032
  • 10
  • 17
3

Here's a NumPy approach using np.bincount that's pretty efficient for such bin-based summing/averaging operations -

ids = df.group.values                    # Extract 2 columns as two arrays
vals = df.value.values

m = np.isnan(vals)                             # Mask of NaNs
grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s
avg_vals = grp_sums*(1.0/np.bincount(ids,~m))        # Group averages
vals[m] = avg_vals[ids[m]]              # Set avg values into NaN positions

Note that this would update the value column.

Runtime test

Datasizes :

size = 1000000  # DataFrame length
ngroups = 10  # Number of Groups

Timings :

In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean()))
1 loops, best of 3: 276 ms per loop

In [18]: %timeit bincount_based(df)
100 loops, best of 3: 13.6 ms per loop

In [19]: 276.0/13.6  # Speedup
Out[19]: 20.294117647058822

20x+ speedup there!

Divakar
  • 218,885
  • 19
  • 262
  • 358