Sklearn preprocessing label encoder is throwing error for mutiple columns

Question

I have pandas Data Frame with following structure

item_condition_id                     category
brand_name                            category
price                                  float64
shipping                              category
main_category                         category
category                              category
sub_category                          category
hashing_feature_aa                     float64
hashing_feature_ab                     float64

Example with portion of data:

brand_name  shipping  main_category        category
Target         1         Women           Tops & Blouses
unknown        1          Home           Home Décor
unknown        0         Women            Jewelry
unknown        0         Women             Other

I have converted categorical (Strings) columns to numerical using below code.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(len(X)):
    X.iloc[:,i] = le.fit_transform(X.iloc[:,i])

After Conversion

   brand_name  shipping  main_category  category
        0         1              1         3
        1         1              0         0
        1         0              1         1
        1         0              1         2

This is working as expected but while trying apply inverse_transform to get the original categories from numerical categories it is throwing error.

for i in range(len(X)):
    X.iloc[:,i] = le.inverse_transform(X.iloc[:,i])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

How to resolve this error in my case , what's wrong with my code ?

My goal is convert categorical (strings) features to numerical using Label Encoder in order to apply sklearn.feature_selection.SelectKbest.fit_transform(X,y), without label encoding this step is failing.

Thanks

without preparing a minimal example with data it is hard to reproduce this issue. Just one thought: are you using the same instance of le to call fit_transform() and inverse_transform()? Because if you iterate over the columns and you fit one LabelEncoder() le per columns, you have to use the same instances of le to call inverse_transform() (e.g. storing it in a dict) — Marcus V., Nov 28 '17 at 11:10
Yes as @MarcusV. have pointed out, in the iteration LabelEncoder instance will be overwritten and thus producing the issue. It would be good if we have a simple example. — Vivek Harikrishnan, Nov 28 '17 at 11:15
Hi Guys, added small example. Yes I'm using the same instance of LabelEncoder() le for both fit_transform and inverse_transform. Do i have to create seperate instances of LabelEncoder for each column !? like le_brand_name ,le_category etc.. — Siva Naidu, Nov 28 '17 at 11:40
Yes, you would have to do this (e.g. store them in a dict). Or check out [this](https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn/31939145#31939145) elegant approach to this problem. — Marcus V., Nov 28 '17 at 11:42
For future, also check out [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) regarding good minimal examples. — Marcus V., Nov 28 '17 at 11:42
I now added it as an answer as well. So if it answered your question well enough feel free to accept :) — Marcus V., Nov 28 '17 at 11:50
Possible duplicate of [Decode pandas dataframe](https://stackoverflow.com/questions/47217821/decode-pandas-dataframe) — Vivek Kumar, Nov 28 '17 at 13:04

score 1 · Accepted Answer · answered Nov 28 '17 at 11:48

Based on your clarification: Your problem is overwriting the instance of le in your loop, so that it is only trained on the last column. Based on your code I would suggest putting them in a dict, e.g. as follows:

from sklearn.preprocessing import LabelEncoder
le = {}
for i in range(len(X)):
    le[i] = LabelEncoder()
    X.iloc[:,i] = le[i].fit_transform(X.iloc[:,i])
# do stuff
for i in range(len(X)):
    X.iloc[:,i] = le[i].inverse_transform(X.iloc[:,i])

But as commented above, also look at this.

Sklearn preprocessing label encoder is throwing error for mutiple columns

1 Answers1