Pandas: filling missing values by mean in each group

Question

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

score 136 · Answer 1 · answered Nov 13 '13 at 22:51

136

One way would be to use transform:

>>> df
  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
  name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

answered Nov 13 '13 at 22:51

DSM

342,061
65
592
494

6

I found it helpful when starting out to sit down and read through the docs. This one is covered in the [`groupby`](http://pandas.pydata.org/pandas-docs/stable/groupby.html) section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on. – DSM Nov 13 '13 at 22:57
1

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better. – Woody Pride Nov 14 '13 at 00:51
51

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value'] – Lauren Jan 10 '17 at 16:57
28

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: `df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))` – André C. Andersen Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. `State` and `Age_Group` then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks – Ozkan Serttas Jan 09 '19 at 20:26
Oh never mind I see the generalized solution thanks to @AndréC.Andersen – Ozkan Serttas Jan 09 '19 at 21:29

score 111 · Answer 2 · answered Nov 16 '18 at 13:59

111

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

answered Nov 16 '18 at 13:59

jpp

159,742
34
281
339

2

Thanks !, I find that the lambda function a little bit confusing and yours much more understandable. – Anindhito Irmandharu Mar 17 '21 at 04:25
4

Nice solution. My groupby returns 73k groups. So in other words it needed to find the mean of 73k groups in order to fill in the NA values for each group. My main concern here is timing as I want to easily scale it to more than 73k groups. The lambda solution took 21.39 seconds to finish while this solution took 0.27 seconds. Highly recommend going for this solution! – Sam Mar 31 '21 at 13:48
2

does df = df.fillna(df.groupby('name').transform('mean')) do this succesfully for all columns? I'm using that, it looks alright but I'm afraid I'm doing something wrong as all do per column here? – Olli Sep 05 '21 at 10:47

André C. Andersen · Answer 3 · 2017-07-28T12:32:53.460

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ...

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

Performance test by increasing the dataset by doing ...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))

Thank you for this great work. I am wondering how I could success the same transformation with using `for` loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen — Ozkan Serttas, Jan 09 '19 at 21:55
Hi @andre-c-andersen, I am trying to use your method for `planets` dataset, but it's not imputing all the values. Not sure why: `https://stackoverflow.com/questions/73449902/fill-in-missing-values-with-groupby/73450241` — Roy, Aug 26 '22 at 18:45

Ashish Anand · Answer 4 · 2021-02-16T17:30:18.193

Shortcut:

Groupby + Apply + Lambda + Fillna + Mean

>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
    0

This solution still works if you want to group by multiple columns to replace missing values.

>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
    'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  

    
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
       
>>> df
        value name   class
    0    1.0    A     p
    1    1.0    A     p
    2    2.0    B     q
    3    2.0    B     q
    4    3.0    B     r
    5    3.0    B     r
    6    3.5    C     s
    7    4.0    C     s
    8    3.0    C     s

score 14 · Answer 5 · edited Oct 09 '17 at 09:55

14

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

edited Oct 09 '17 at 09:55

IanS

15,771
9
60
84

answered Nov 18 '16 at 17:18

piRSquared

285,575
57
475
624

1

A slightly different version to this `df['value_imputed'] = np.where(df.value.isnull(), df.groupby('group').value.transform('mean'), df.value)` – tsando Jul 16 '19 at 10:13

Philipp Schwarz · Answer 6 · 2016-10-13T09:08:20.097

6

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))

edited Oct 13 '16 at 09:08

answered Oct 13 '16 at 08:52

Philipp Schwarz

18,050
5
32
36

This answer worked for me, thanks. Also for anyone new to pandas, can also index using slicing notation `df.groupby("continent")['Crude_Birth_rate']...` I believe this is the suggested covnention – Adam Hughes Nov 07 '19 at 19:07

score 4 · Answer 7 · answered Apr 01 '21 at 12:39

To summarize all above concerning the efficiency of the possible solution I have a dataset with 97 906 rows and 48 columns. I want to fill in 4 columns with the median of each group. The column I want to group has 26 200 groups.

The first solution

start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds

The second solution

start = time.time()
for col in continuous_variables:
    df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds

The next solution I only performed on a subset since it was running too long.

start = time.time()
for col in continuous_variables:
    x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds

The following solution follows the same logic as above.

start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds

So it's quite important to choose the right method. Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

score 2 · Answer 8 · answered Mar 09 '16 at 14:36

2

def groupMeanValue(group):
    group['value'] = group['value'].fillna(group['value'].mean())
    return group

dft = df.groupby("name").transform(groupMeanValue)

answered Mar 09 '16 at 14:36

Prajit Patil

29
2

chrslg · Answer 9 · 2022-11-20T16:30:56.870

I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.

Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.

What I would do here is

df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')

Or using fillna

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.

score 0 · Answer 10 · answered Jan 09 '23 at 15:36

0

To fill all the numeric null values with the mean grouped by "name"

num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

answered Jan 09 '23 at 15:36

abu8na9

103
1
8

score -1 · Answer 11 · edited Oct 04 '18 at 18:19

-1

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

edited Oct 04 '18 at 18:19

Paul Roub

36,322
27
84
93

answered Oct 04 '18 at 18:11

Vino Vincent

17
3

6

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers? – divibisan Oct 04 '18 at 20:28
1

@vino please add some explanation – Noordeen Feb 16 '19 at 19:28
That would be an interesting solution if it were working. It is the only one that does not rely on apply or lambdas (which leads to quite slow execution time, because it implies iterations in python world, rather than in C world). But the problem is that it doesn't work. It just produce a series associating index 0 to mean of As, that is 1, index 1 to mean of Bs=2, index 2 to mean of Cs=3. Then fillna replace, among rows 0, 1, 2 of df the NaN values by matching values in this mean table. So, filling row 1 with value 2, and row 2 with value 3. Which are both wrong. And letting row 7 with NaN – chrslg Nov 20 '22 at 16:27

score -1 · Answer 12 · edited Sep 28 '19 at 19:51

-1

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

edited Sep 28 '19 at 19:51

Jack Fleeting

24,385
6
23
45

answered Sep 28 '19 at 19:10

Hardik Pachgade

1

Pandas: filling missing values by mean in each group

12 Answers12

`fillna` + `groupby` + `transform` + `mean`

Linked

Related

Pandas: filling missing values by mean in each group

12 Answers12

fillna + groupby + transform + mean

Linked

Related

`fillna` + `groupby` + `transform` + `mean`