apply label encoder for multiple columns in train and test dataset

Question

I have a dataset which contains multiple columns which has values in string format.Now i need to convert these text column to numeric values using labelEncoder. In below e,g y is target of my tain dataset and and A0 to A13 are different features . There are 50 more features but i have provided a subset here. Now how do i apply labelencoder on for dataset from A0 to A8 together and create a new encoded dataframe for creating the model ? I know we can do something like below, but this would say encode only one column. I want to encoder to be applied for all column from A0 to A8 and then feed the data to the model. How can i do that ?

    from sklearn.preprocessing import LabelEncoder
    gender_encoder = LabelEncoder()
    y = gender_encoder.fit_transform(y)

Sample data below

           y       A0 A1  A2 A3 A4  A5 A6 A8  A10  A12  A13
    0     130.81   k  v  at  a  d   u  j  o    0    0    1
    1      88.53   k  t  av  e  d   y  l  o    0    0    0
    2      76.26  az  w   n  c  d   A  j  A    0    0    0
    3      80.62  az  t   n  f  d   A  l  e    0    0    0
    4      78.02  az  v   n  f  d   h  d  n    0    0    0

score 0 · Answer 1 · answered Jul 31 '20 at 08:09

you could use ColumnTransformer from sklearn:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

col_trans = make_column_transformer((OneHotEncoder(), ["A0:A8"]), remainder = "passthrough")

You can list individual columns inside [ ] and specify what to do with the "remainder-columns"

score 0 · Answer 2 · answered Jul 31 '20 at 08:10

You can use ColumnTransformer and Pipeline to encode all categorical columns. After you can also add transformation for the numerical columns.

categorical_features = ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A8']
categorical_transformer = Pipeline(steps=[('le', LabelEncoder())])

preprocessor = ColumnTransformer(transformers=[('cat', 
                                                 categorical_transformer, 
                                                 categorical_features)])
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

pipeline.fit(X_train)

apply label encoder for multiple columns in train and test dataset

2 Answers2