sklearn Imputer() returned features does not fit in fit function

Question

I have a feature matrix with missing values NaNs, so I need to initialize those missing values first. However, the last line complains and throws out the following line of error: Expected sequence or array-like, got Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0). I checked, it seems the reason is that train_fea_imputed is not in np.array format, but sklearn.preprocessing.imputation.Imputer form. How should I fix this?
BTW, if I use train_fea_imputed = imp.fit_transform(train_fea), the code works fine, but train_fea_imputed return an array with 1 dimension less than train_fea

    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.preprocessing import Imputer

    imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
    train_fea_imputed = imp.fit(train_fea)

    # train_fea_imputed = imp.fit_transform(train_fea)
    rf = RandomForestClassifier(n_estimators=5000,n_jobs=1, min_samples_leaf = 3)
    rf.fit(train_fea_imputed, train_label)

update: I changed to

imp = Imputer(missing_values='NaN', strategy='mean', axis=1)

and now the dimension problem did not occur. I think there is some inherent issues in the imputing function. I will come back when I finish the project.

Can you make a sample `train_fea` and `train_label` with dummy values so I can run this on my computer? e.g. like this http://stackoverflow.com/a/30319249/238639 — bakkal, Jun 01 '15 at 22:56
@bakkal I think you can use numpy to generate random matrix and mask them with NaN to try. — Jin, Jun 01 '15 at 23:12

maxymoo · Accepted Answer · 2015-06-01T23:19:02.377

4

With scikit-learn, initialising the model, training the model and getting the predictions are seperate steps. In your case you have:

train_fea = np.array([[1,1,0],[0,0,1],[1,np.nan,0]])
train_fea
array([[  1.,   1.,   0.],
       [  0.,   0.,   1.],
       [  1.,  nan,   0.]])

#initialise the model
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

#train the model
imp.fit(train_fea)

#get the predictions
train_fea_imputed = imp.transform(train_fea)
train_fea_imputed
array([[ 1. ,  1. ,  0. ],
       [ 0. ,  0. ,  1. ],
       [ 1. ,  0.5,  0. ]])

edited Jun 01 '15 at 23:19

answered Jun 01 '15 at 22:58

maxymoo

35,286
11
92
119

your solution indeed returns the train_fea_imputed with a dimension lower than my original train_fea. I think it is equivalent to apply fit_transform directly. My problem is that I need to have apply feature_score later in random forest. If I have fewer dimension, I have no way to associate the score with feature name. – Jin Jun 01 '15 at 23:08
I've added an example matrix to my answer, I don't seem to be having problems with dimensionality? – maxymoo Jun 01 '15 at 23:17
I changed the Imputer parameter axis = 1 and now the problem is fixed. I think my train_fea has some wield columns – Jin Jun 01 '15 at 23:29

score 1 · Answer 2 · answered Apr 12 '16 at 03:09

1

I think that axis = 1 is not correct in this case since you want to take mean across the values of feature vector/column (axis = 0) and not row (axis = 1).

answered Apr 12 '16 at 03:09

Ankit

11
1

sklearn Imputer() returned features does not fit in fit function

2 Answers2