pandas DataFrame: replace nan values with average of columns

Question

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.

score 405 · Accepted Answer · edited Jan 19 '18 at 17:49

405

You can simply use DataFrame.fillna to fill the nan's directly:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

edited Jan 19 '18 at 17:49

Nae

14,209
7
52
79

answered Sep 09 '13 at 05:27

bmu

35,119
13
91
108

23

`df.fillna(df.mean())` will return the new dataframe, so you will have to write `df=df.fillna(df.mean())` to keep it. – yannis May 04 '18 at 19:14
any ideas why I might be getting the wrong amount imputed for the mean using this? – bernando_vialli Jun 08 '18 at 14:29
34

Instead of `df=df.fillna(df.mean())` you also could use `df.fillna(df.mean(), inplace=True)` – Anderson Pimentel Nov 25 '18 at 20:56
73

CAUTION: if you want to use this for Machine Learning / Data Science: from a Data Science perspective it is *wrong* to first replace NA and then split into train and test... You MUST first split into train and test, then replace NA by mean on train and then apply this stateful preprocessing model to test, see the answer involving sklearn below! – Fabian Werner Aug 28 '19 at 09:18
@FabianWerner. Why is it wrong? I think its a better idea if NaNs are replaced by global average rather than train_mean and test_mean separately, as train/test_mean can be misleading. – amalik2205 May 11 '20 at 17:19
6

@amalik2205 because otherwise you are leaking information from the test set into the training set! Imagine it like this: We have 100 data rows and we consider column x. The first 99 entries of x are NA. We want to split off row 100 as a test set. Let's assume row 100 has value 20 in column x. Then you will replace all entries in the training set in column x with 20, a value coming 100% from the test set. Hence, the evaluation might fool you! – Fabian Werner May 11 '20 at 19:18
Curious, why is it not `df.fillna(df.mean(1))` if we want the column average? Doesn't it default to the index by default? Edit: I see that just putting `()` works fine, I just don't understand why... – JDelage Sep 28 '20 at 21:35

score 98 · Answer 2 · edited Oct 16 '15 at 20:30

98

Try:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

edited Oct 16 '15 at 20:30

kenorb

155,785
88
678
743

answered Oct 16 '15 at 20:18

Ammar Shigri

981
6
2

6

For someone wondering about inplace = True: If it is True it the original object is modified with this change. If it is False (default) the function doesn't modify the original object, instead it returns a modified copy of it and you have to assign it to the original object to replace it. – hafiz031 Aug 18 '20 at 02:31
Hi, can you please check this "https://stackoverflow.com/questions/76706011/mean-imputation-return-none-and-nan-in-dataframe/76706102#76706102" – Encipher Jul 17 '23 at 15:46

score 48 · Answer 3 · answered Jul 29 '20 at 17:08

Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:

df.fillna(df.mean())

In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.

I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value

#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----

df.fillna(df.mean())

#------------------------------------------------
#----(Code 2) Selective Treatment----------------

for i in df.columns[df.isnull().any(axis=0)]:     #---Applying Only on variables with NaN values
    df[i].fillna(df[i].mean(),inplace=True)

#---df.isnull().any(axis=0) gives True/False flag (Boolean value series), 
#---which when applied on df.columns[], helps identify variables with NaN values

Below is the performance i observed, as i kept on increasing the # records in DataFrame

DataFrame with ~100k records

Code 1: 22.06 Seconds
Code 2: 0.03 Seconds

DataFrame with ~200k records

Code 1: 180.06 Seconds
Code 2: 0.06 Seconds

DataFrame with ~1.6 Million records

Code 1: code kept running endlessly
Code 2: 0.40 Seconds

DataFrame with ~13 Million records

Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds

Apologies for a long answer ! Hope this helps !

I have ``timed it`` all above methods and yours is the fastest. Thanks man. — jawad mansoor, Oct 07 '20 at 10:32

score 42 · Answer 4 · answered Sep 09 '13 at 00:15

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

I dont know why, but df.fillna(df.mean()) didnt work, only your version with apply. Python 3 — Rocketq, Dec 25 '17 at 11:22

score 16 · Answer 5 · edited Jun 27 '18 at 22:27

16

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

edited Jun 27 '18 at 22:27

O'Neil

3,790
4
16
30

answered Feb 26 '17 at 03:15

Pranay Aryal

5,208
4
30
41

3

Please provide some explanation of how this solves the problem. – Gurwinder Singh Feb 26 '17 at 03:35
Is it possible to do a Group By for the mean as well? – opperman.eric Nov 11 '22 at 02:58

score 15 · Answer 6 · edited Jan 02 '20 at 20:54

15

# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

edited Jan 02 '20 at 20:54

Md Rafee

5,070
3
23
32

answered Jul 10 '17 at 18:54

Roshan jha

161
1
2

What's the advantage of all this over the far simpler alternatives? – AMC Feb 16 '20 at 04:22
@Roshan Jha It is always better to explain the logic. There are many ways to do the same task in R & Python. However, if you are suggesting something different, you may want to point out some advanatages of doing so – Dr Nisha Arora Mar 11 '20 at 11:41

score 14 · Answer 7 · edited Jun 27 '18 at 22:28

Directly use df.fillna(df.mean()) to fill all the null value with mean

If you want to fill null value with mean of that column then you can use this

suppose x=df['Item_Weight'] here Item_Weight is column name

here we are assigning (fill null values of x with mean of x into x)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

here Outlet_size is column name

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

score 14 · Answer 8 · edited Jun 20 '20 at 00:35

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:

Use method .fillna():

mean_value=df['nr_items'].mean() df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.

You should be careful when using the mean. If you have outliers is more recommendable to use the median

score 10 · Answer 9 · answered Nov 15 '16 at 19:40

Another option besides those above is:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

score 3 · Answer 10 · edited May 24 '20 at 09:07

3

using sklearn library preprocessing class

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN

edited May 24 '20 at 09:07

MedoAlmasry

452
5
19

answered Mar 12 '20 at 03:58

Shrikant Chaudhari

43
5

score 1 · Answer 11 · answered May 25 '21 at 09:28

1

I use this method to fill missing values by average of a column.

fill_mean = lambda col : col.fillna(col.mean())

df = df.apply(fill_mean, axis = 0)

answered May 25 '21 at 09:28

Punit Vara

3,744
1
16
30

score 0 · Answer 12 · answered Jul 07 '21 at 09:41

0

You can also use value_counts to get the most frequent values. This would work on different datatypes.

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

Here is the value_counts api reference.

answered Jul 07 '21 at 09:41

pratsbhatt

1,498
10
20

pandas DataFrame: replace nan values with average of columns

12 Answers12

Linked

Related