How to encode 20+ columns with categorical data using sklearn in python

Question

I have a dataset with 20+ columns each with categorical data. How do I encode those using sklearn in python. I tried LabelBinarizer, LabelEncoder, Onehotencoder but it does not work.

One of the error:

ValueError: Multioutput target data is not supported with label binarization

I am using a kaggle dataset

datasets = pd.read_csv('mushrooms.csv')
x = datasets.iloc[:, 1:23].values
y = datasets.iloc[:,0].values

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
datasets_cat_hot = encoder.fit_transform(x_train)

same question here https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn — seralouk, Nov 21 '17 at 09:08

constt · Answer 1 · 2017-11-21T03:37:31.877

The LabelBinarizer, as well as the LabelEncoder, could not be applied over multiple columns of a numpy array. But you can use the apply method of the pandas data frame to do the trick. Here is a complete solution:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.read_csv('mushrooms.csv')
X_df = df.iloc[:, 1:]
y_df = df.iloc[:, 0]

X_df = X_df.apply(LabelEncoder().fit_transform)

X = OneHotEncoder(sparse=False).fit_transform(X_df.values)
y = LabelEncoder().fit_transform(y_df.values)

How to encode 20+ columns with categorical data using sklearn in python

1 Answers1