0

I am trying oneHotEncoder on the categiorical values

However its failing with below error. What could be goind wrong ? Please help , any comments are alwaya welcome.

Below is the code snipet

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
print(X.shape)
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
print(X)
print(X.shape)
print(y)
#X = X.reshape(len(X[:, 0]), 7)
print(X.shape)
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
print(X.shape)
print(X)

=================================================================== The output of the code is as below Looks like the issue is with array formatting

 I am a getting following ouput 
(17, 7)
[[2 0 0 'Offline' 'Low' 'Cold' 'No']
 [0 0 0 'Offline' 'High' 'Cold' 'No']
 [3 0 1 'Online' 'High' 'Cold' 'Yes']
 [2 0 1 'Offline' 'Low' 'Hot' 'Yes']
 [2 0 1 'Offline' 'High' 'Hot' 'Yes']
 [2 0 0 'Online' 'High' 'Cold' 'Yes']
 [2 1 1 'Offline' 'Low' 'Hot' 'No']
 [2 1 0 'Offline' 'Low' 'Cold' 'No']
 [0 1 0 'Online' 'Low' 'Cold' 'Yes']
 [3 1 1 'Online' 'Low' 'Hot' 'Yes']
 [1 1 0 'Offline' 'Low' 'Hot' 'No']
 [2 1 1 'Offline' 'Low' 'Hot' 'Yes']
 [3 1 1 'Online' 'High' 'Hot' 'Yes']
 [2 1 0 'Online' 'High' 'Hot' 'No']
 [2 2 2 'Offline' 'Low' 'Hot' 'Yes']
 [2 2 1 'Offline' 'Low' 'Cold' 'No']
 [1 2 0 'Offline' 'High' 'Cold' 'Yes']]
(17, 7)
['Low' 'Low' 'High' 'High' 'High' 'Low' 'Low' 'Low' 'Low' 'High' 'Low'
 'High' 'High' 'High' 'High' 'Low' 'Low']
(17, 7)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-84bec98371d4> in <module>()
     28 print(X.shape)
     29 onehotencoder = OneHotEncoder(categorical_features = [0])
---> 30 X = onehotencoder.fit_transform(X).toarray()
     31 print(X.shape)
     32 print(X)

C:\Users\patilsi\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   2017         """
   2018         return _transform_selected(X, self._fit_transform,
-> 2019                                    self.categorical_features, copy=True)
   2020 
   2021     def _transform(self, X):

C:\Users\patilsi\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1807     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1808     """
-> 1809     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1810 
   1811     if isinstance(selected, six.string_types) and selected == "all":

C:\Users\patilsi\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

(17, 7)
[[2 0 0 'Offline' 'Low' 'Cold' 'No']
 [0 0 0 'Offline' 'High' 'Cold' 'No']
 [3 0 1 'Online' 'High' 'Cold' 'Yes']
 [2 0 1 'Offline' 'Low' 'Hot' 'Yes']
 [2 0 1 'Offline' 'High' 'Hot' 'Yes']
 [2 0 0 'Online' 'High' 'Cold' 'Yes']
 [2 1 1 'Offline' 'Low' 'Hot' 'No']
 [2 1 0 'Offline' 'Low' 'Cold' 'No']
 [0 1 0 'Online' 'Low' 'Cold' 'Yes']
 [3 1 1 'Online' 'Low' 'Hot' 'Yes']
 [1 1 0 'Offline' 'Low' 'Hot' 'No']
 [2 1 1 'Offline' 'Low' 'Hot' 'Yes']
 [3 1 1 'Online' 'High' 'Hot' 'Yes']
 [2 1 0 'Online' 'High' 'Hot' 'No']
 [2 2 2 'Offline' 'Low' 'Hot' 'Yes']
 [2 2 1 'Offline' 'Low' 'Cold' 'No']
 [1 2 0 'Offline' 'High' 'Cold' 'Yes']]
(17, 7)
['Low' 'Low' 'High' 'High' 'High' 'Low' 'Low' 'Low' 'Low' 'High' 'Low'
 'High' 'High' 'High' 'High' 'Low' 'Low']
(17, 7)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-84bec98371d4> in <module>()
     28 print(X.shape)
     29 onehotencoder = OneHotEncoder(categorical_features = [0])
---> 30 X = onehotencoder.fit_transform(X).toarray()
     31 print(X.shape)
     32 print(X)

C:\Users\patilsi\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   2017         """
   2018         return _transform_selected(X, self._fit_transform,
-> 2019                                    self.categorical_features, copy=True)
   2020 
   2021     def _transform(self, X):

C:\Users\patilsi\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1807     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1808     """
-> 1809     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1810 
   1811     if isinstance(selected, six.string_types) and selected == "all":

C:\Users\patilsi\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: could not convert string to float: 'Yes'
Raj
  • 1
  • 1
  • 3
  • You should put more attention to formatting. There is a preview area under the text editor when you write/edit a question. Good question formatting makes reading and understanding questions a lot easier. – Shai Jul 29 '18 at 12:08

2 Answers2

1

You should apply your OneHotEncoder on the column you want like:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

onehotencoder = OneHotEncoder()
X_0 = onehotencoder.fit_transform(X[:, 0]).toarray()
X_1 = onehotencoder.fit_transform(X[:, 1]).toarray()

This will return you 2 matrices with the same number of rows as X and a number of column based on the number of different values in X[:, 0] or X[:, 1]

After you are free to merge matrices or whatever. If you want to know the column or a specific category, you can access onehotencoder.feature_indices_ but as you use the same OHE, you will lose info for the feature X0.

I hope it helps,

Nicolas M.
  • 1,472
  • 1
  • 13
  • 26
  • Not really . i Had written sample program which could take X with row and column intact and could still encode. I am specifically looking as to why its complaning about "ValueError: could not convert string to float: 'Yes'" – Raj Jul 29 '18 at 16:44
  • from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) print(X[:, 0]) #print(X) X = X.reshape(len(X[:, 0]), 3) print(y) print(X) onehotencoder = OneHotEncoder(categorical_features = [0]) X = onehotencoder.fit_transform(X).toarray() #onehotencoder.shape print(X) – Raj Jul 29 '18 at 16:45
  • I checked the error and found this SO question : https://stackoverflow.com/questions/43588679/issue-with-onehotencoder-for-categorical-features. It seems that I'm wrong, we need to use labelEncoder first. I'm used to prepare data with get_dummies() of pandas so now, I know why ^^ – Nicolas M. Jul 30 '18 at 13:53
0

Even if you specify categorical_features = [0], OneHotEncoder will still check all the data (of all columns) to be compatible with scikit-learn and hence throws error when other columns contains string data.

So what you can do here is that send only that data that you want to dummy encode:-

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
print(X.shape)
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
print(X)
print(X.shape)
print(y)
#X = X.reshape(len(X[:, 0]), 7)
print(X.shape)

onehotencoder = OneHotEncoder()

categorical_features = [0]
# Send only the first column to onehotencoder
X_oneHotEncoded = onehotencoder.fit_transform(X[:, categorical_features]).toarray()

# Combine the two arrays back together
X_final = np.hstack((X_oneHotEncoded, X[:,1:]))
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132