Recoding Catigorical Variables in Python

Question

I've been trying to learn Python 3.6 using the Anaconda distribution. I've hit a snag with the content of the online course I'm using, and could use some help working through some error messages. I'd ask the instructors of the course, but they don't seem very responsive to questions from students.

I've been having some trouble working with the three dominant classes used to recode categorical data. As I understand it, there are three classes drawn from the scikitlearn package used for recoding variables: LabelEncoder, OneHotEncoder and LabelBinarizer. I have attempted to employ each to recode a categorical variable inside a dataset, but keep getting errors for each.

Please pardon my relative noobness for the samples codes. As one might have guessed by the baseness of my question, I am not well versed in python.

The object X contains a few columns, the first being a categorical string I need to convert (If someone could also tell me how to insert tables, that'd be helpful. Do I have to use HTML?):

"Fish" 1 5 3
"Dog" 2 6 9
"Dog" 8 8 8
"Cat" 5 7 6
"Cat" 6 6 6

Label Encoder Attempt

Below is the code I attempted to implement, and the resulting error message I received for the object X, which has roughly the properties I described above.

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

TypeError: fit_transform() missing 1 required positional argument: 'y'

What is throwing me is I thought the above code was clearly defining what y is, the first column of X.

OneHotEncoder

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

TypeError: 'method' object is not subscriptable

Label Binarizer

I've found this one the hardest to understand, and actually couldn't make an attempt based on the structure of the dataset.

Any guidance or suggestions you could provide would be endlessly helpful.

score 3 · Accepted Answer · answered Apr 06 '18 at 04:48

Lets take it step by step.

First load the data you showed in a numpy array of name X

import numpy as np
X = np.array([["Fish", 1, 5, 3],
              ["Dog",  2, 6, 9],
              ["Dog",  8, 8, 8],
              ["Cat",  5, 7, 6],
              ["Cat",  6, 6, 6]])

Now try your codes.

1) LabelEncoder

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

The thing you are doing wrong here is that you are using the class LabelEncoder as an object, calling fit_transform on it. So correct that by:

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

See the changes in line 2 and 3 above. First I made an object labelencoder_X of the LabelEncoder class by calling LabelEncoder() and then use that object to call fit_transform() by using labelencoder_X.fit_transform(). Then this code dont give any error and new X is:

Output:
array([['2', '1', '5', '3'],
       ['1', '2', '6', '9'],
       ['1', '8', '8', '8'],
       ['0', '5', '7', '6'],
       ['0', '6', '6', '6']], dtype='|S4')

See that the first column has been changed successfully.

2) OneHotEncoder

Your code:

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

Now here, you are not doing the mistake you did in LabelEncoder. You are correctly initializing the object by calling OneHotEncoder(...). But you made a mistake by using fit_transform[X]. You see fit_transform is a method and should be called using the round parentheses like this: fit_transform().

See this question for more details about the error.

The correct code should be:

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform(X).toarray()

Output: 
array([[0., 0., 1., 1., 5., 3.],
       [0., 1., 0., 2., 6., 9.],
       [0., 1., 0., 8., 8., 8.],
       [1., 0., 0., 5., 7., 6.],
       [1., 0., 0., 6., 6., 6.]])

Note: The above code should be called on X which have been already transformed with LabelEncoder. If you use it on original X, it will still throw an error.

3) LabelBinarizer This is nothing really different from LabelEncoder, just that it will do the one-hot encoding as well for the supplied column.

from sklearn.preprocessing import LabelBinarizer
labelencoder_X =LabelBinarizer()
new_binarized_val = labelencoder_X.fit_transform(X[:, 0])

Output:
array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0]])

Note: The LabelBinarizer code I used on original X from your question, not the already encoded one. And the output shows only the binarized form of first column.

Hope this makes things clear.

Ahhhh now everything makes sense. I didn't realize that LabelEncoder was chaining into OneHotEncoder. I Thought they were separate processes trying to do the same thing. I'd found one site that suggested you needed to use parentheses for referencing X in OneHotEncoder, but that hadn't worked. I now understand the reason I was still getting an error was because of the errors found in LabelEncoder. Thanks so much! — Fledgling Analyst, Apr 06 '18 at 13:22

Recoding Catigorical Variables in Python

1 Answers1