1

I am trying to access multiple Pandas DataFrame columns of the Pima Indians Diabetes Dataset (those ones listed in cols_missing) and replace existing zeros with np.NaN.

# Replace missing values (meaningless 0s) in df with NaN
cols_missing_vals = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']]
df = cols_missing_vals.replace(to_replace=0, value=np.NaN, inplace=False)
# Fill NaN values with mean imputation
df = df.fillna(value=df.mean(), inplace=False)

When I try to operate with inplace=False in order to avoid chained assignments and reassign the variable, the operations are performed but the three columns where no replace operation takes place are dropped.

    Glucose BloodPressure   SkinThickness   Insulin BMI Age
0   148.0   72.0    35.00000    155.548223  33.6    50
1   85.0    66.0    29.00000    155.548223  26.6    31
2   183.0   64.0    29.15342    155.548223  23.3    32
3   89.0    66.0    23.00000    94.000000   28.1    21
4   137.0   40.0    35.00000    168.000000  43.1    33
... ... ... ... ... ... ...
763 101.0   76.0    48.00000    180.000000  32.9    63
764 122.0   70.0    27.00000    155.548223  36.8    27
765 121.0   72.0    23.00000    112.000000  26.2    30
766 126.0   60.0    29.15342    155.548223  30.1    47
767 93.0    70.0    31.00000    155.548223  30.4    23

However, once I use inplace=True and skip reassigning the variable, no replacement operation takes place at all.

cols_missing_vals = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']]
cols_missing_vals.replace(to_replace=0, value=np.NaN, inplace=True)
# Fill NaN values with mean imputation
df.fillna(value=df.mean(), inplace=True)
    Pregnancies Glucose BloodPressure   SkinThickness   Insulin BMI DiabetesPedigreeFunction    Age Outcome
0   6   148 72  35  0   33.6    0.627   50  1
1   1   85  66  29  0   26.6    0.351   31  0
2   8   183 64  0   0   23.3    0.672   32  1
3   1   89  66  23  94  28.1    0.167   21  0
4   0   137 40  35  168 43.1    2.288   33  1
... ... ... ... ... ... ... ... ... ...
763 10  101 76  48  180 32.9    0.171   63  0
764 2   122 70  27  0   36.8    0.340   27  0
765 5   121 72  23  112 26.2    0.245   30  0
766 1   126 60  0   0   30.1    0.349   47  1
767 1   93  70  31  0   30.4    0.315   23  0

How can I ensure to keep the entire initial df, but with successful mean imputation for those columns where it is useful? E.g., while skin thickness can't be 0 and which indicates a missing value, pregnancies is totally okay with being 0, however I want to keep both of them in the df. Thanks!

29nivek
  • 35
  • 7

2 Answers2

2

Try this, as per this answer:

cols_missing_vals = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']
df[cols_missing_vals] = df[cols_missing_vals].replace(to_replace=0, value=np.NaN)
df[cols_missing_vals] = df[cols_missing_vals].fillna(value=df.mean())
ljdyer
  • 1,946
  • 1
  • 3
  • 11
  • Thank you for your answer Ijdyer! I think that's the most concise way to do it, and I was not aware that newly assigning the variable will lead to the dataframe being filled with the altered columns. – 29nivek Oct 17 '22 at 11:12
1

The scikit learn package in python has functionality for mean imputation.

I first created a toy pandas dataframe to illustrate:

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

columns =     ["Pregnancies", "Glucose", "BloodPressure",   "SkinThickness",   "Insulin", "BM", "DiabetesPedigreeFunction",    "Age", "Outcome"]
data = np.random.randint(low=0, high=10, size = (10, len(columns)))
data = pd.DataFrame(data = data, columns = columns)
data
>       Pregnancies Glucose BloodPressure   SkinThickness   Insulin BM  DiabetesPedigreeFunction    Age Outcome
>     0 7   7   7   6   8   0   9   0   0
>     1 7   9   0   0   0   2   5   8   8
>     2 3   3   6   3   0   9   9   0   7
>     3 5   1   4   6   8   7   2   8   6
>     4 7   4   4   7   7   8   7   1   4
>     5 8   8   2   7   4   5   0   5   9
>     6 7   0   3   4   9   6   0   9   3
>     7 6   4   0   8   5   0   1   2   5
>     8 6   1   8   3   6   0   2   0   2
>     9 3   0   1   8   0   3   3   0   6



Here, I apply that functionality to impute the 0 values (which are your missing_values).

impute_mean = SimpleImputer(missing_values = 0, strategy = "mean")
data = pd.DataFrame(data = impute_mean.fit_transform(data), columns = columns)
data
>   Pregnancies Glucose BloodPressure   SkinThickness   Insulin BM  DiabetesPedigreeFunction    Age Outcome
> 0 7.0 7.000   7.000   6.000000    8.000000    5.714286    9.00    5.5 5.555556
> 1 7.0 9.000   4.375   5.777778    6.714286    2.000000    5.00    8.0 8.000000
> 2 3.0 3.000   6.000   3.000000    6.714286    9.000000    9.00    5.5 7.000000
> 3 5.0 1.000   4.000   6.000000    8.000000    7.000000    2.00    8.0 6.000000
> 4 7.0 4.000   4.000   7.000000    7.000000    8.000000    7.00    1.0 4.000000
> 5 8.0 8.000   2.000   7.000000    4.000000    5.000000    4.75    5.0 9.000000
> 6 7.0 4.625   3.000   4.000000    9.000000    6.000000    4.75    9.0 3.000000
> 7 6.0 4.000   4.375   8.000000    5.000000    5.714286    1.00    2.0 5.000000
> 8 6.0 1.000   8.000   3.000000    6.000000    5.714286    2.00    5.5 2.000000
> 9 3.0 4.625   1.000   8.000000    6.714286    3.000000    3.00    5.5 6.000000

Snehal Patel
  • 192
  • 10
  • Thank you Snehal! I forgot about SimpleImputer and this is an elegant way to do it. In case there are only a subset of the columns that you want to impute (such as Glucose, SkinThickness etc. but not Pregnancies), would you then just use the following? `cols_missing_vals = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age'] data = pd.DataFrame(data = impute_mean.fit_transform(data), columns = df[cols_missing_vals])` – 29nivek Oct 17 '22 at 11:18
  • Sry, I mean: `cols_missing_vals = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age'] data = pd.DataFrame(data = impute_mean.fit_transform(data[cols_missing_vals]), columns = [cols_missing_vals])` – 29nivek Oct 17 '22 at 11:25
  • 1
    Hi @29nivek, yes, that is exactly right. You can simply subset the columns that you want and then do the imputation. However, remember that SimpleImputer remove the column names, so you have to add them back through the pandas dataframe, as I had shown: `data = pd.DataFrame(data = impute_mean.fit_transform(data), columns = columns)`. This means that you will need to take into consideration which columns you subset on. – Snehal Patel Oct 17 '22 at 13:56