299

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.

Community
  • 1
  • 1
piokuc
  • 25,594
  • 11
  • 72
  • 102

12 Answers12

405

You can simply use DataFrame.fillna to fill the nan's directly:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

Nae
  • 14,209
  • 7
  • 52
  • 79
bmu
  • 35,119
  • 13
  • 91
  • 108
  • 23
    `df.fillna(df.mean())` will return the new dataframe, so you will have to write `df=df.fillna(df.mean())` to keep it. – yannis May 04 '18 at 19:14
  • any ideas why I might be getting the wrong amount imputed for the mean using this? – bernando_vialli Jun 08 '18 at 14:29
  • 34
    Instead of `df=df.fillna(df.mean())` you also could use `df.fillna(df.mean(), inplace=True)` – Anderson Pimentel Nov 25 '18 at 20:56
  • 73
    CAUTION: if you want to use this for Machine Learning / Data Science: from a Data Science perspective it is *wrong* to first replace NA and then split into train and test... You MUST first split into train and test, then replace NA by mean on train and then apply this stateful preprocessing model to test, see the answer involving sklearn below! – Fabian Werner Aug 28 '19 at 09:18
  • @FabianWerner. Why is it wrong? I think its a better idea if NaNs are replaced by global average rather than train_mean and test_mean separately, as train/test_mean can be misleading. – amalik2205 May 11 '20 at 17:19
  • 6
    @amalik2205 because otherwise you are leaking information from the test set into the training set! Imagine it like this: We have 100 data rows and we consider column x. The first 99 entries of x are NA. We want to split off row 100 as a test set. Let's assume row 100 has value 20 in column x. Then you will replace all entries in the training set in column x with 20, a value coming 100% from the test set. Hence, the evaluation might fool you! – Fabian Werner May 11 '20 at 19:18
  • Curious, why is it not `df.fillna(df.mean(1))` if we want the column average? Doesn't it default to the index by default? Edit: I see that just putting `()` works fine, I just don't understand why... – JDelage Sep 28 '20 at 21:35
98

Try:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)
kenorb
  • 155,785
  • 88
  • 678
  • 743
Ammar Shigri
  • 981
  • 6
  • 2
  • 6
    For someone wondering about inplace = True: If it is True it the original object is modified with this change. If it is False (default) the function doesn't modify the original object, instead it returns a modified copy of it and you have to assign it to the original object to replace it. – hafiz031 Aug 18 '20 at 02:31
  • Hi, can you please check this "https://stackoverflow.com/questions/76706011/mean-imputation-return-none-and-nan-in-dataframe/76706102#76706102" – Encipher Jul 17 '23 at 15:46
48

Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:

df.fillna(df.mean())

In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.

I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value

#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----

df.fillna(df.mean())

#------------------------------------------------
#----(Code 2) Selective Treatment----------------

for i in df.columns[df.isnull().any(axis=0)]:     #---Applying Only on variables with NaN values
    df[i].fillna(df[i].mean(),inplace=True)

#---df.isnull().any(axis=0) gives True/False flag (Boolean value series), 
#---which when applied on df.columns[], helps identify variables with NaN values

Below is the performance i observed, as i kept on increasing the # records in DataFrame

DataFrame with ~100k records

  • Code 1: 22.06 Seconds
  • Code 2: 0.03 Seconds

DataFrame with ~200k records

  • Code 1: 180.06 Seconds
  • Code 2: 0.06 Seconds

DataFrame with ~1.6 Million records

  • Code 1: code kept running endlessly
  • Code 2: 0.40 Seconds

DataFrame with ~13 Million records

  • Code 1: --did not even try, after seeing performance on 1.6 Mn records--
  • Code 2: 3.20 Seconds

Apologies for a long answer ! Hope this helps !

Gaurav Singh
  • 799
  • 6
  • 7
42
In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Jeff
  • 125,376
  • 21
  • 220
  • 187
16

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
O'Neil
  • 3,790
  • 4
  • 16
  • 30
Pranay Aryal
  • 5,208
  • 4
  • 30
  • 41
15
# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Md Rafee
  • 5,070
  • 3
  • 23
  • 32
Roshan jha
  • 161
  • 1
  • 2
  • What's the advantage of all this over the far simpler alternatives? – AMC Feb 16 '20 at 04:22
  • @Roshan Jha It is always better to explain the logic. There are many ways to do the same task in R & Python. However, if you are suggesting something different, you may want to point out some advanatages of doing so – Dr Nisha Arora Mar 11 '20 at 11:41
14

Directly use df.fillna(df.mean()) to fill all the null value with mean

If you want to fill null value with mean of that column then you can use this

suppose x=df['Item_Weight'] here Item_Weight is column name

here we are assigning (fill null values of x with mean of x into x)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

here Outlet_size is column name

df.Outlet_Size = df.Outlet_Size.fillna('Missing')
O'Neil
  • 3,790
  • 4
  • 16
  • 30
Sunny Barnwal
  • 141
  • 1
  • 3
14

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:

Use method .fillna():

mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.

You should be careful when using the mean. If you have outliers is more recommendable to use the median

Csaba Toth
  • 10,021
  • 5
  • 75
  • 121
pink.slash
  • 1,817
  • 15
  • 14
10

Another option besides those above is:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

guibor
  • 550
  • 5
  • 6
3

using sklearn library preprocessing class

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN

MedoAlmasry
  • 452
  • 5
  • 19
1

I use this method to fill missing values by average of a column.

fill_mean = lambda col : col.fillna(col.mean())

df = df.apply(fill_mean, axis = 0)
Punit Vara
  • 3,744
  • 1
  • 16
  • 30
0

You can also use value_counts to get the most frequent values. This would work on different datatypes.

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

Here is the value_counts api reference.

pratsbhatt
  • 1,498
  • 10
  • 20