sklearn throws ValueError: bad input shape (600000, 24)

Question

Below is a piece of my code for Label Encoding. While implementing Label Encoder for one column of the Dataframe at a time, it worked fine. But, when I tried to implement on whole categorical features at once sklearn throws

ValueError: bad input shape (600000, 24). I'm not able to find any specific reason for that.

df = pd.read_csv("../inputs/cat-in-the-dat-train-folds.csv")
    
# extracting the categorical features 
cat_features = [x for x in df.columns if x not in ( "id", "target", "kflod")]
    
for col in cat_features:
    df.loc[:, col] = df[col].astype(str).fillna("NONE")
df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)
lbl_enc = preprocessing.LabelEncoder()
full_cat_data = pd.concat(
        [df_train[cat_features], df_valid[cat_features]],
        axis=0)
lbl_enc.fit(full_cat_data)
x_train = lbl_enc.transform(df_train[cat_features])
x_valid = lbl_enc.transform(df_valid[cat_features])

Maybe [this](https://stackoverflow.com/a/31939145/8505949) could help you. — speedi33, Oct 07 '20 at 10:45
In the `cat_features` listcomp, is `"kflod"` intentional or is that a typo of `"kfold"`? — Chris Greening, Oct 07 '20 at 15:51

Chris Greening · Answer 1 · 2020-10-07T15:55:31.153

0

sklearn.preprocessing.LabelEncoder.fit only takes a 1D array as a parameter.

To fit multiple columns, use sklearn.preprocessing.OrdinalEncoder.fit which can take multi-dimensional data with [n_samples, n_features] (as per the documentation)

In your example, try replacing lbl_enc = preprocessing.LabelEncoder() with lbl_enc = preprocessing.OrdinalEncoder() and that should work.

See this answer here for more information on the difference between LabelEncoder and OrdinalEncoder

Additional resources:

Label encoding across multiple columns in scikit-learn

edited Oct 07 '20 at 15:55

answered Oct 07 '20 at 15:50

Chris Greening

510
5
14

Thank you for taking the time to explain this. – RevolverRakk Oct 16 '20 at 15:43
@RakeshKumarKuwar Of course! Was this able to solve your problem? – Chris Greening Oct 16 '20 at 17:00
Yes, I implemented ordinal Encoder for transforming the whole categorical data & it worked. Although, not getting great accuracy with Logres, will try with the xgboost model. I sincerely appreciate your help. – RevolverRakk Oct 19 '20 at 05:43

sklearn throws ValueError: bad input shape (600000, 24)

1 Answers1