Replacing rows having 0 to mean/ of the column

Question

I'm trying to impute 5 columns in a dataset, each column however do not have any blanks in them, I need to impute rows having 0 to mean/median, I tried the following 2 alternatives independently as shown below

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=0,strategy='mean')

impute.fit_transform(train[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']])

AND

train["Glucose"].fillna(train["Glucose"].mean(), inplace=True)

To cross check i tried to find unique values in each column train['Glucose'].unique(), after each of the alternatives to find if there are any 0 after imputing. Output does shows 0 as under, suggesting the above 2 methods failed to work.

Output

array([148,  85, 183,  89, 137, 116,  78, 115, 197, 125, 110, 168, 139,
       189, 166, 100, 118, 107, 103, 126,  99, 196, 119, 143, 147,  97,
       145, 117, 109, 158,  88,  92, 122, 138, 102,  90, 111, 180, 133,
       106, 171, 159, 146,  71, 105, 101, 176, 150,  73, 187,  84,  44,
       141, 114,  95, 129,  79,   **0**,  62, 131, 112, 113,  74,  83, 136,
        80, 123,  81, 134, 142, 144,  93, 163, 151,  96, 155,  76, 160,
       124, 162, 132, 120, 173, 170, 128, 108, 154,  57, 156, 153, 188,
       152, 104,  87,  75, 179, 130, 194, 181, 135, 184, 140, 177, 164,
        91, 165,  86, 193, 191, 161, 167,  77, 182, 157, 178,  61,  98,
       127,  82,  72, 172,  94, 175, 195,  68, 186, 198, 121,  67, 174,
       199,  56, 169, 149,  65, 190], dtype=int64)

I would really appreciate if someone could guide me where my code is wrong or any other way to impute.

The question marked as already answered does not specifically ans my question, I do not want to replace blanks with 0, I want to replace 0 to mean. Request @jezrael to re-open the post. I spent the past 30 mins checking but found nothing useful to help my query. — Sid, Dec 20 '19 at 12:23
Yes, I went through the suggested questions again, they all try to replace nan with 0, I think my query is slightly different. I may be doing some concept error. — Sid, Dec 20 '19 at 12:27

jezrael · Answer 1 · 2019-12-20T13:02:46.123

If want replace 0 to means your first solution working nice for me, second solution is necessary change with replace 0 to NaNs and then is possible use fillna:

np.random.seed(42)
columns = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
train = pd.DataFrame(np.random.randint(5, size=(10,5)), columns=columns)
print (train)
   Glucose  BloodPressure  SkinThickness  Insulin  BMI
0        3              4              2        4    4
1        1              2              2        2    4
2        3              2              4        1    3
3        1              3              4        0    3
4        1              4              3        0    0
5        2              2              1        3    3
6        2              3              3        0    2
7        4              2              4        0    1
8        3              0              3        1    1
9        0              1              4        1    3

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=0,strategy='mean')

c = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df1 = pd.DataFrame(impute.fit_transform(train[c]), columns=c)
print (df1)
    Glucose  BloodPressure  SkinThickness  Insulin       BMI
0  3.000000       4.000000            2.0      4.0  4.000000
1  1.000000       2.000000            2.0      2.0  4.000000
2  3.000000       2.000000            4.0      1.0  3.000000
3  1.000000       3.000000            4.0      2.0  3.000000
4  1.000000       4.000000            3.0      2.0  2.666667
5  2.000000       2.000000            1.0      3.0  3.000000
6  2.000000       3.000000            3.0      2.0  2.000000
7  4.000000       2.000000            4.0      2.0  1.000000
8  3.000000       2.555556            3.0      1.0  1.000000
9  2.222222       1.000000            4.0      1.0  3.000000

df2 = train.mask(train == 0)
df2 = df2.fillna(df2.mean())
print (df2)
    Glucose  BloodPressure  SkinThickness  Insulin       BMI
0  3.000000       4.000000              2      4.0  4.000000
1  1.000000       2.000000              2      2.0  4.000000
2  3.000000       2.000000              4      1.0  3.000000
3  1.000000       3.000000              4      2.0  3.000000
4  1.000000       4.000000              3      2.0  2.666667
5  2.000000       2.000000              1      3.0  3.000000
6  2.000000       3.000000              3      2.0  2.000000
7  4.000000       2.000000              4      2.0  1.000000
8  3.000000       2.555556              3      1.0  1.000000
9  2.222222       1.000000              4      1.0  3.000000

EDIT: Solution if more columns which cannnot be imputed:

np.random.seed(42)
columns = ['col1','col2','Glucose','BloodPressure','SkinThickness','Insulin','BMI']
train = pd.DataFrame(np.random.randint(5, size=(10,7)), columns=columns)
print (train)
   col1  col2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
0     3     4        2              4              4        1    2
1     2     2        4              3              2        4    1
2     3     1        3              4              0        3    1
3     4     3        0              0              2        2    1
4     3     3        2              3              3        0    2
5     4     2        4              0              1        3    0
6     3     1        1              0              1        4    1
7     3     3        3              3              4        2    0
8     3     1        3              1              1        3    4
9     1     1        3              1              1        3    3

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=0,strategy='mean')

c = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
train[c] = impute.fit_transform(train[c])
print (train)
   col1  col2   Glucose  BloodPressure  SkinThickness   Insulin    BMI
0     3     4  2.000000       4.000000       4.000000  1.000000  2.000
1     2     2  4.000000       3.000000       2.000000  4.000000  1.000
2     3     1  3.000000       4.000000       2.111111  3.000000  1.000
3     4     3  2.777778       2.714286       2.000000  2.000000  1.000
4     3     3  2.000000       3.000000       3.000000  2.777778  2.000
5     4     2  4.000000       2.714286       1.000000  3.000000  1.875
6     3     1  1.000000       2.714286       1.000000  4.000000  1.000
7     3     3  3.000000       3.000000       4.000000  2.000000  1.875
8     3     1  3.000000       1.000000       1.000000  3.000000  4.000
9     1     1  3.000000       1.000000       1.000000  3.000000  3.000

c = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df2 = train[c].mask(train[c] == 0)
train[c] = df2.fillna(df2.mean())
print (train)
   col1  col2   Glucose  BloodPressure  SkinThickness   Insulin    BMI
0     3     4  2.000000       4.000000       4.000000  1.000000  2.000
1     2     2  4.000000       3.000000       2.000000  4.000000  1.000
2     3     1  3.000000       4.000000       2.111111  3.000000  1.000
3     4     3  2.777778       2.714286       2.000000  2.000000  1.000
4     3     3  2.000000       3.000000       3.000000  2.777778  2.000
5     4     2  4.000000       2.714286       1.000000  3.000000  1.875
6     3     1  1.000000       2.714286       1.000000  4.000000  1.000
7     3     3  3.000000       3.000000       4.000000  2.000000  1.875
8     3     1  3.000000       1.000000       1.000000  3.000000  4.000
9     1     1  3.000000       1.000000       1.000000  3.000000  3.000

The above suggestion creates a separate DataFrame right, However I prefer to do this without creating a new data bcoz I have other columns apart from the one's which I'm trying to impute to help me predict the outcome. Total column in the data are as follows _Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')_ & I'm only trying to impute _• Glucose • BloodPressure • SkinThickness • Insulin • BMI_ Apologies for failing to mention this before. — Sid, Dec 20 '19 at 12:54
Second alternative provided me the required solution, Thank you.! I will work on the previous code. — Sid, Dec 20 '19 at 12:59

Replacing rows having 0 to mean/ of the column

1 Answers1