2

I have below piece of code where i am trying use one hot encoder. But i get the the errorValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

 from sklearn.preprocessing import LabelEncoder, OneHotEncoder
 import pandas as pd

 target=train_features_df['y']
 train_features_df=train_features_df.drop(['y'], axis=1)

 # Categorical boolean mask this is done to find all categorical dfeature
 categorical_feature_mask = train_features_df.dtypes==object
 # filter categorical columns using mask and turn it into a list
 categorical_cols = train_features_df.columns[categorical_feature_mask].tolist()

 # instantiate labelencoder object
 le = LabelEncoder()
 # apply le on categorical feature columns
 train_features_df[categorical_cols] = train_features_df[categorical_cols].apply(lambda col: 
 le.fit_transform(col))
 train_features_df[categorical_cols].head(10)

 # instantiate OneHotEncoder
 ohe = OneHotEncoder(categories = categorical_feature_mask, sparse=False ) 
 # categorical_features = boolean mask for categorical columns
 # sparse = False output an array not sparse matrix

 # apply OneHotEncoder on categorical feature columns
 ohe.fit_transform(train_features_df)

I am get this error on the last line "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). on line ohe.fit_transform(train_features_df)

Full traceback message as requested is below:-

   ---------------------------------------------------------------------------
   ValueError                                Traceback (most recent call last)
   <ipython-input-12-72e45bd93f15> in <module>
        23 
        24 # apply OneHotEncoder on categorical feature columns
   ---> 25 ohe.fit_transform(train_features_df)
        26 #train_encoded_df=pd.DataFrame(data = ohe.fit_transform(train_features_df)) # It returns an numpy array
   
   ~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in fit_transform(self, X, y)
       408         """
       409         self._validate_keywords()
   --> 410         return super().fit_transform(X, y)
       411 
       412     def transform(self, X):
   
   ~\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
       688         if y is None:
       689             # fit method of arity 1 (unsupervised transformation)
   --> 690             return self.fit(X, **fit_params).transform(X)
       691         else:
       692             # fit method of arity 2 (supervised transformation)
   
   ~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in fit(self, X, y)
       383         """
       384         self._validate_keywords()
   --> 385         self._fit(X, handle_unknown=self.handle_unknown)
       386         self.drop_idx_ = self._compute_drop_idx()
       387         return self
   
   ~\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _fit(self, X, handle_unknown)
        74         X_list, n_samples, n_features = self._check_X(X)
        75 
   ---> 76         if self.categories != 'auto':
        77             if len(self.categories) != n_features:
        78                 raise ValueError("Shape mismatch: if categories is an array,"
   
   ~\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
      1477     def __nonzero__(self):
      1478         raise ValueError(
   -> 1479             f"The truth value of a {type(self).__name__} is ambiguous. "
      1480             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
      1481         )
   
   ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Invictus
  • 4,028
  • 10
  • 50
  • 80

1 Answers1

2

Invictus,

The error is caused by the fact that you are passing in categories parameter something that is not expected by encoder function. If you want to select just categorical columns using selection, do this:

ohe = OneHotEncoder(categories = 'auto', sparse=False ) 
selection = train_features_df[train_features_df.columns[categorical_feature_mask]]
encoded = ohe.fit_transform(selection)

and then merge the encoded result with the non-categorical columns

if you want to use categories parameter to pass categories values - use example from here

A more elegant would be to use Pandas function for one-hot encoding:

pd.get_dummies(data=train_features_df, columns=train_features_df.columns[categorical_feature_mask])
Poe Dator
  • 4,535
  • 2
  • 14
  • 35
  • In case i use get_dummies and i have seperate dataframe for test and train then how do i make sure that encoding applied on train_features_df is exactly applied to test_features_df ? Is there some way to make it s surety ? – Invictus Aug 01 '20 at 04:18
  • typically encoding is done before train-test split. If this is not possible - use `OneHotEncoder` and pass same lists of categories values every time. BTW - no need for label encoder here. It will destroy the consistency between test and train sets. – Poe Dator Aug 01 '20 at 04:24
  • ok , for me label column is missing in test as expected, so i should extract everything from train data except the label one and merge both of them and then and then apply ohe basically and then segregate them again and add label column back to train dataset in that case. What do you think about that ? – Invictus Aug 01 '20 at 04:28
  • It may work. But my preference would be to stack dataframes, apply pd.get_dummies and then split train and test. Note that ohe will give you a numpy.array so you'd need to carefully add labels to the resulting encoded columns. – Poe Dator Aug 01 '20 at 04:37
  • Shape of concat_train_test is: (8418, 364) and out of 364 i have only 8 columns which are caategorical. When i apply below `ohe = OneHotEncoder(categories = 'auto', sparse=False ) selection = train_features_df[train_features_df.columns[categorical_feature_mask]] encoded = ohe.fit_transform(selection)` i see that the encoded shape is (8418, 211) looks like total number of unique values in this 8 column ? During merging with non categorical data do i need to name all these column before applying PCA and Xgboost – Invictus Aug 01 '20 at 05:49
  • apparently yes. See how pd.get_dummies names such variables and do the same. Please consider accepting this answer and opening a new question for the complexities you encouter in `ohe`. My advice is stack train+test and pd.get_dummies. – Poe Dator Aug 01 '20 at 05:55
  • Great using pd.get_dummies i see the final shape is same as what we would have got after merging the putput of `ohe` with all the columns which were non-categorical. Thanks for your help. – Invictus Aug 01 '20 at 05:58