0

I'm trying to impute 5 columns in a dataset, each column however do not have any blanks in them, I need to impute rows having 0 to mean/median, I tried the following 2 alternatives independently as shown below

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=0,strategy='mean')

impute.fit_transform(train[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']])

AND

train["Glucose"].fillna(train["Glucose"].mean(), inplace=True)

To cross check i tried to find unique values in each column train['Glucose'].unique(), after each of the alternatives to find if there are any 0 after imputing. Output does shows 0 as under, suggesting the above 2 methods failed to work.

Output

array([148,  85, 183,  89, 137, 116,  78, 115, 197, 125, 110, 168, 139,
       189, 166, 100, 118, 107, 103, 126,  99, 196, 119, 143, 147,  97,
       145, 117, 109, 158,  88,  92, 122, 138, 102,  90, 111, 180, 133,
       106, 171, 159, 146,  71, 105, 101, 176, 150,  73, 187,  84,  44,
       141, 114,  95, 129,  79,   **0**,  62, 131, 112, 113,  74,  83, 136,
        80, 123,  81, 134, 142, 144,  93, 163, 151,  96, 155,  76, 160,
       124, 162, 132, 120, 173, 170, 128, 108, 154,  57, 156, 153, 188,
       152, 104,  87,  75, 179, 130, 194, 181, 135, 184, 140, 177, 164,
        91, 165,  86, 193, 191, 161, 167,  77, 182, 157, 178,  61,  98,
       127,  82,  72, 172,  94, 175, 195,  68, 186, 198, 121,  67, 174,
       199,  56, 169, 149,  65, 190], dtype=int64)

I would really appreciate if someone could guide me where my code is wrong or any other way to impute.

Sid
  • 163
  • 7
  • The question marked as already answered does not specifically ans my question, I do not want to replace blanks with 0, I want to replace 0 to mean. Request @jezrael to re-open the post. I spent the past 30 mins checking but found nothing useful to help my query. – Sid Dec 20 '19 at 12:23
  • So both solutions failed? – jezrael Dec 20 '19 at 12:24
  • Yes, I went through the suggested questions again, they all try to replace nan with 0, I think my query is slightly different. I may be doing some concept error. – Sid Dec 20 '19 at 12:27

1 Answers1

3

If want replace 0 to means your first solution working nice for me, second solution is necessary change with replace 0 to NaNs and then is possible use fillna:

np.random.seed(42)
columns = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
train = pd.DataFrame(np.random.randint(5, size=(10,5)), columns=columns)
print (train)
   Glucose  BloodPressure  SkinThickness  Insulin  BMI
0        3              4              2        4    4
1        1              2              2        2    4
2        3              2              4        1    3
3        1              3              4        0    3
4        1              4              3        0    0
5        2              2              1        3    3
6        2              3              3        0    2
7        4              2              4        0    1
8        3              0              3        1    1
9        0              1              4        1    3

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=0,strategy='mean')

c = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df1 = pd.DataFrame(impute.fit_transform(train[c]), columns=c)
print (df1)
    Glucose  BloodPressure  SkinThickness  Insulin       BMI
0  3.000000       4.000000            2.0      4.0  4.000000
1  1.000000       2.000000            2.0      2.0  4.000000
2  3.000000       2.000000            4.0      1.0  3.000000
3  1.000000       3.000000            4.0      2.0  3.000000
4  1.000000       4.000000            3.0      2.0  2.666667
5  2.000000       2.000000            1.0      3.0  3.000000
6  2.000000       3.000000            3.0      2.0  2.000000
7  4.000000       2.000000            4.0      2.0  1.000000
8  3.000000       2.555556            3.0      1.0  1.000000
9  2.222222       1.000000            4.0      1.0  3.000000

df2 = train.mask(train == 0)
df2 = df2.fillna(df2.mean())
print (df2)
    Glucose  BloodPressure  SkinThickness  Insulin       BMI
0  3.000000       4.000000              2      4.0  4.000000
1  1.000000       2.000000              2      2.0  4.000000
2  3.000000       2.000000              4      1.0  3.000000
3  1.000000       3.000000              4      2.0  3.000000
4  1.000000       4.000000              3      2.0  2.666667
5  2.000000       2.000000              1      3.0  3.000000
6  2.000000       3.000000              3      2.0  2.000000
7  4.000000       2.000000              4      2.0  1.000000
8  3.000000       2.555556              3      1.0  1.000000
9  2.222222       1.000000              4      1.0  3.000000

EDIT: Solution if more columns which cannnot be imputed:

np.random.seed(42)
columns = ['col1','col2','Glucose','BloodPressure','SkinThickness','Insulin','BMI']
train = pd.DataFrame(np.random.randint(5, size=(10,7)), columns=columns)
print (train)
   col1  col2  Glucose  BloodPressure  SkinThickness  Insulin  BMI
0     3     4        2              4              4        1    2
1     2     2        4              3              2        4    1
2     3     1        3              4              0        3    1
3     4     3        0              0              2        2    1
4     3     3        2              3              3        0    2
5     4     2        4              0              1        3    0
6     3     1        1              0              1        4    1
7     3     3        3              3              4        2    0
8     3     1        3              1              1        3    4
9     1     1        3              1              1        3    3

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=0,strategy='mean')

c = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
train[c] = impute.fit_transform(train[c])
print (train)
   col1  col2   Glucose  BloodPressure  SkinThickness   Insulin    BMI
0     3     4  2.000000       4.000000       4.000000  1.000000  2.000
1     2     2  4.000000       3.000000       2.000000  4.000000  1.000
2     3     1  3.000000       4.000000       2.111111  3.000000  1.000
3     4     3  2.777778       2.714286       2.000000  2.000000  1.000
4     3     3  2.000000       3.000000       3.000000  2.777778  2.000
5     4     2  4.000000       2.714286       1.000000  3.000000  1.875
6     3     1  1.000000       2.714286       1.000000  4.000000  1.000
7     3     3  3.000000       3.000000       4.000000  2.000000  1.875
8     3     1  3.000000       1.000000       1.000000  3.000000  4.000
9     1     1  3.000000       1.000000       1.000000  3.000000  3.000

c = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df2 = train[c].mask(train[c] == 0)
train[c] = df2.fillna(df2.mean())
print (train)
   col1  col2   Glucose  BloodPressure  SkinThickness   Insulin    BMI
0     3     4  2.000000       4.000000       4.000000  1.000000  2.000
1     2     2  4.000000       3.000000       2.000000  4.000000  1.000
2     3     1  3.000000       4.000000       2.111111  3.000000  1.000
3     4     3  2.777778       2.714286       2.000000  2.000000  1.000
4     3     3  2.000000       3.000000       3.000000  2.777778  2.000
5     4     2  4.000000       2.714286       1.000000  3.000000  1.875
6     3     1  1.000000       2.714286       1.000000  4.000000  1.000
7     3     3  3.000000       3.000000       4.000000  2.000000  1.875
8     3     1  3.000000       1.000000       1.000000  3.000000  4.000
9     1     1  3.000000       1.000000       1.000000  3.000000  3.000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    The above suggestion creates a separate DataFrame right, However I prefer to do this without creating a new data bcoz I have other columns apart from the one's which I'm trying to impute to help me predict the outcome. Total column in the data are as follows _Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')_ & I'm only trying to impute _• Glucose • BloodPressure • SkinThickness • Insulin • BMI_ Apologies for failing to mention this before. – Sid Dec 20 '19 at 12:54
  • Second alternative provided me the required solution, Thank you.! I will work on the previous code. – Sid Dec 20 '19 at 12:59
  • @Sid - Solutions are modified for new requriment. – jezrael Dec 20 '19 at 13:03