0

I have a dataset with 20+ columns each with categorical data. How do I encode those using sklearn in python. I tried LabelBinarizer, LabelEncoder, Onehotencoder but it does not work.

One of the error:

ValueError: Multioutput target data is not supported with label binarization

I am using a kaggle dataset

datasets = pd.read_csv('mushrooms.csv')
x = datasets.iloc[:, 1:23].values
y = datasets.iloc[:,0].values

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
datasets_cat_hot = encoder.fit_transform(x_train)
Kda
  • 57
  • 1
  • 2
  • 6
  • same question here https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn – seralouk Nov 21 '17 at 09:08

1 Answers1

1

The LabelBinarizer, as well as the LabelEncoder, could not be applied over multiple columns of a numpy array. But you can use the apply method of the pandas data frame to do the trick. Here is a complete solution:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.read_csv('mushrooms.csv')
X_df = df.iloc[:, 1:]
y_df = df.iloc[:, 0]

X_df = X_df.apply(LabelEncoder().fit_transform)

X = OneHotEncoder(sparse=False).fit_transform(X_df.values)
y = LabelEncoder().fit_transform(y_df.values)
constt
  • 2,250
  • 1
  • 17
  • 18