133

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

jpp
  • 159,742
  • 34
  • 281
  • 339
BlueFeet
  • 2,407
  • 4
  • 21
  • 24

12 Answers12

136

One way would be to use transform:

>>> df
  name  value
0    A      1
1    A    NaN
2    B    NaN
3    B      2
4    B      3
5    B      1
6    C      3
7    C    NaN
8    C      3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
  name  value
0    A      1
1    A      1
2    B      2
3    B      2
4    B      3
5    B      1
6    C      3
7    C      3
8    C      3
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 6
    I found it helpful when starting out to sit down and read through the docs. This one is covered in the [`groupby`](http://pandas.pydata.org/pandas-docs/stable/groupby.html) section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on. – DSM Nov 13 '13 at 22:57
  • 1
    Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better. – Woody Pride Nov 14 '13 at 00:51
  • 51
    if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value'] – Lauren Jan 10 '17 at 16:57
  • 28
    @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: `df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))` – André C. Andersen Jul 28 '17 at 12:11
  • I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. `State` and `Age_Group` then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks – Ozkan Serttas Jan 09 '19 at 20:26
  • Oh never mind I see the generalized solution thanks to @AndréC.Andersen – Ozkan Serttas Jan 09 '19 at 21:29
111

fillna + groupby + transform + mean

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 2
    Thanks !, I find that the lambda function a little bit confusing and yours much more understandable. – Anindhito Irmandharu Mar 17 '21 at 04:25
  • 4
    Nice solution. My groupby returns 73k groups. So in other words it needed to find the mean of 73k groups in order to fill in the NA values for each group. My main concern here is timing as I want to easily scale it to more than 73k groups. The lambda solution took 21.39 seconds to finish while this solution took 0.27 seconds. Highly recommend going for this solution! – Sam Mar 31 '21 at 13:48
  • 2
    does df = df.fillna(df.groupby('name').transform('mean')) do this succesfully for all columns? I'm using that, it looks alright but I'm afraid I'm doing something wrong as all do per column here? – Olli Sep 05 '21 at 10:47
27

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(
    {
        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],
        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
    }
)

... gives ...

  category name  other_value value
0        X    A         10.0   1.0
1        X    A          NaN   NaN
2        X    B          NaN   NaN
3        X    B         20.0   2.0
4        X    B         30.0   3.0
5        X    B         10.0   1.0
6        Y    C         30.0   3.0
7        Y    C          NaN   NaN
8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

Performance test by increasing the dataset by doing ...

big_df = None
for _ in range(10000):
    if big_df is None:
        big_df = df.copy()
    else:
        big_df = pd.concat([big_df, df])
df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd
from datetime import datetime

def generate_data():
    ...

t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
    .transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)

# 0:00:00.016012

t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
    .transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)

# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
    .transform(lambda x: x.fillna(x.mean()))
André C. Andersen
  • 8,955
  • 3
  • 53
  • 79
  • Thank you for this great work. I am wondering how I could success the same transformation with using `for` loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen – Ozkan Serttas Jan 09 '19 at 21:55
  • Hi @andre-c-andersen, I am trying to use your method for `planets` dataset, but it's not imputing all the values. Not sure why: `https://stackoverflow.com/questions/73449902/fill-in-missing-values-with-groupby/73450241` – Roy Aug 26 '22 at 18:45
  • @Roy I looked into your question and made an answer. – André C. Andersen Aug 27 '22 at 19:30
17

Shortcut:

Groupby + Apply + Lambda + Fillna + Mean

>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
    0 

This solution still works if you want to group by multiple columns to replace missing values.

>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
    'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  

    
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
       
>>> df
        value name   class
    0    1.0    A     p
    1    1.0    A     p
    2    2.0    B     q
    3    2.0    B     q
    4    3.0    B     r
    5    3.0    B     r
    6    3.5    C     s
    7    4.0    C     s
    8    3.0    C     s
 
Ashish Anand
  • 2,575
  • 23
  • 15
14

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
IanS
  • 15,771
  • 9
  • 60
  • 84
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    A slightly different version to this `df['value_imputed'] = np.where(df.value.isnull(), df.groupby('group').value.transform('mean'), df.value)` – tsando Jul 16 '19 at 10:13
6

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
    lambda x: x.fillna(x.mean()))
Philipp Schwarz
  • 18,050
  • 5
  • 32
  • 36
  • This answer worked for me, thanks. Also for anyone new to pandas, can also index using slicing notation `df.groupby("continent")['Crude_Birth_rate']...` I believe this is the suggested covnention – Adam Hughes Nov 07 '19 at 19:07
4

To summarize all above concerning the efficiency of the possible solution I have a dataset with 97 906 rows and 48 columns. I want to fill in 4 columns with the median of each group. The column I want to group has 26 200 groups.

The first solution

start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds

The second solution

start = time.time()
for col in continuous_variables:
    df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds

The next solution I only performed on a subset since it was running too long.

start = time.time()
for col in continuous_variables:
    x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds

The following solution follows the same logic as above.

start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds

So it's quite important to choose the right method. Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

Sam
  • 405
  • 4
  • 14
2
def groupMeanValue(group):
    group['value'] = group['value'].fillna(group['value'].mean())
    return group

dft = df.groupby("name").transform(groupMeanValue)
2

I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.

Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.

What I would do here is

df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')

Or using fillna

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.

chrslg
  • 9,023
  • 5
  • 17
  • 31
0

To fill all the numeric null values with the mean grouped by "name"

num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
abu8na9
  • 103
  • 1
  • 8
-1
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
  • 6
    Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers? – divibisan Oct 04 '18 at 20:28
  • 1
    @vino please add some explanation – Noordeen Feb 16 '19 at 19:28
  • That would be an interesting solution if it were working. It is the only one that does not rely on apply or lambdas (which leads to quite slow execution time, because it implies iterations in python world, rather than in C world). But the problem is that it doesn't work. It just produce a series associating index 0 to mean of As, that is 1, index 1 to mean of Bs=2, index 2 to mean of Cs=3. Then fillna replace, among rows 0, 1, 2 of df the NaN values by matching values in this mean table. So, filling row 1 with value 2, and row 2 with value 3. Which are both wrong. And letting row 7 with NaN – chrslg Nov 20 '22 at 16:27
-1

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45