34

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: PG

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?

Medo
  • 952
  • 3
  • 11
  • 22

7 Answers7

51

If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)

Output:

[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

EDIT

As of 0.20 this became a bit easier, not only because OneHotEncoder now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer, see below for an example

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

X = np.array([['apple', 'red', 1, 'round', 0],
              ['orange', 'orange', 2, 'round', 0.1],
              ['bannana', 'yellow', 2, 'long', 0],
              ['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
    [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to
    remainder='passthrough'  # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string

Output:

[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
 ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
 ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
 ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]
piman314
  • 5,285
  • 23
  • 35
  • 3
    I don't understand this answer at all. Where do you fit your encoders with data from dataset ? Could you please provide more elaborate example with dataset from the question ? – Niakrais Apr 25 '18 at 13:41
  • 1
    How do you do this in a pipline? – Dwagner Sep 20 '18 at 12:57
  • Honestly, the naming of variables is confusing. cat_features is not the list of categorical features in a dataset but it is dataset itself with 1 column which is categorical. LabelEncoder encodes one categorical variable at a time – ShikharDua Oct 09 '18 at 01:15
  • Regarding EDIT: Using a Pandas dataframe allows for mixed typed output. ```X = pd.DataFrame([['apple', 'red', 1, 'round', 0], ...``` with ```ct = ColumnTransformer([('oh_enc', OneHotEncoder(sparse=False), [0, 1])], ...``` preoduces mixed output: ```[[1.0 0.0 0.0 0.0 0.0 1.0 0.0 1 'round' 0.0]...``` – Sebastian Kropp Dec 10 '19 at 20:58
14

You can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

cat_features = ['color', 'director_name', 'actor_2_name']
encoder = LabelBinarizer()
new_cat_features = encoder.fit_transform(cat_features)
new_cat_features

Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

Source Hands-On Machine Learning with Scikit-Learn and TensorFlow

Fallou Tall
  • 141
  • 1
  • 4
7

If the dataset is in pandas data frame, using

pandas.get_dummies

will be more straightforward.

*corrected from pandas.get_getdummies to pandas.get_dummies

Ken Wallace
  • 2,261
  • 2
  • 13
  • 13
HappyCoding
  • 5,029
  • 7
  • 31
  • 51
5

from the documentation:

categorical_features : “all” or array of indices or mask
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.

column names of pandas dataframe won't work. if you categorical features are column numbers 0, 2 and 6 use :

from sklearn import preprocessing
cat_features = [0, 2, 6]
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

It must also be noted that if these categorical features are not label encoded, you need to use LabelEncoder on these features before using OneHotEncoder

Abhishek Thakur
  • 16,337
  • 15
  • 66
  • 97
2

A comment to @piman314's answer (not enough reputation to make a comment)

This problem only happens for sklearn version <= 0.19. Documentation of 0.19 for fit method only allows integer input:

fit(X, y = None)

X: Input array of type int.

Later version (documentation of 0.20) automatically deal with the input datatype and allows string input

fit(X, y = None)

X: The data to determine the categories of each feature.

Community
  • 1
  • 1
zljt3216
  • 53
  • 1
  • 5
1

@Medo,

I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

Bahman Engheta
  • 106
  • 1
  • 7
0

There is a simple fix if, like me, you get frustrated by this. Simply use Category Encoders' OneHotEncoder. This is a Sklearn Contrib package, so plays super nicely with the scikit-learn API.

This works as a direct replacement and does the boring label encoding for you.

from category_encoders import OneHotEncoder
cat_features = ['color', 'director_name', 'actor_2_name']
enc = OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)
Little Bobby Tables
  • 4,466
  • 4
  • 29
  • 46