5

I have 2 different csv which has a train data and test data. I created two different dataframes from these train_features_df and test_features_df. Note that the test and train data have multiple categorical columns, so i need to apply labelEncoder on them as it is suitable as for my dataset. So i had separately applied label encoder on train and test data. When i print the new encoded value of train and test dataset i see for the same categorical value of same feature the output from new encoded data is different. Does that mean i have to merge the train and test data. Then apply label encoding and then seperate them back again ?

 from sklearn.preprocessing import LabelEncoder
 target=train_features_df['y']
 train_features_df=train_features_df.drop(['y'], axis=1)
 train_features_df.head()
 y = target.values
 print("printing feature column of train datasets: \n")
 print(train_features_df.values)
 le=LabelEncoder()
 X_train_label_encoded=train_features_df.apply(le.fit_transform)
 print("\n printing feature column of train datasets after label encoder: \n")
 print(X_train_label_encoded.head())

 print("printing test feature datasets: \n")
 print(test_features_df)
 X_test_label_encoded=test_features_df.apply(le.fit_transform)
 print("printing test feature encoded  datasets: \n")
 print(X_test_label_encoded)

Output of above is below:-

printing feature column of train datasets: 

[['k' 'v' 'at' ... 0 0 0]
 ['k' 't' 'av' ... 0 0 0]
 ['az' 'w' 'n' ... 0 0 0]

    X0  X1  X2  X3  X4  X5  X6  X8  X10  X12  ...  X375  X376  X377  X378  \
 0  32  23  17   0   3  24   9  14    0    0  ...     0     0     1     0   
 1  32  21  19   4   3  28  11  14    0    0  ...     1     0     0     0   
 2  20  24  34   2   3  27   9  23    0    0  ...     0     0     0     0

 printing test feature datasets: 

       X0  X1  X2 X3 X4  X5 X6 X8  X10  X12  ...  X375  X376  X377  X378  X379  \
 0     az   v   n  f  d   t  a  w    0    0  ...     0     0     0     1     0   
 1      t   b  ai  a  d   b  g  y    0    0  ...     0     0     1     0     0   
 2     az   v  as  f  d   a  j  j    0    0  ...     0     0     0     1     0

       X0  X1  X2  X3  X4  X5  X6  X8  X10  X12  ...  X375  X376  X377  X378  \
 0     21  23  34   5   3  26   0  22    0    0  ...     0     0     0     1   
 1     42   3   8   0   3   9   6  24    0    0  ...     0     0     1     0   
 2     21  23  17   5   3   0   9   9    0    0  ...     0     0     0     1   
 3     21  13  34   5   3  31  11  13    0    0  ...     0     0     0     1   
 4     45  20  17   2   3  30   8  12    0    0  ...     1     0     0     0

If we see in train dataframe after lebel encoding the az value in first column got transformed to value 20 while in test dataframe after lebel encoding the az value in first column got transformed to value 21.

Invictus
  • 4,028
  • 10
  • 50
  • 80

1 Answers1

7

It is possible that the unique values appearing in the training and the test sets are different. And in that case, the encodings will be different too.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit_transform([1,2,3,4,5])
# array([0, 1, 2, 3, 4], dtype=int64)
le.fit_transform([2,3,4,5])
# array([0, 1, 2, 3], dtype=int64)

You should be fitting with the train data, and then transform on the test data to get the same encodings:

l_train = [1,2,3,4,5]
le.fit(l_train)
le.transform(l_train)
# array([0, 1, 2, 3, 4], dtype=int64)
le.transform([2,3,4,5])
#array([1, 2, 3, 4], dtype=int64)

Do note though that you should not be using a label encoder for the categorical features. See LabelEncoder for categorical features? for an explanation of why. LabelEncoder should only be used on the label. You should be looking at OneHotEncoder for instance.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
yatu
  • 86,083
  • 12
  • 84
  • 139
  • in this case how do and where do i see a final dataframe with all columns with finally encoded value ? I tried doing below and got the error y should be a 1d array, got an array of shape (4209, 364) instead. does it not take the entire dataframe ? le=LabelEncoder() le.fit(train_features_df) le.transform(train_features_df) – Invictus Jul 31 '20 at 15:22
  • It expects a 1d array, because as I'm telling you, this is thought for the label column, i.e a síngle column, not multiple features @invictus – yatu Jul 31 '20 at 15:51
  • How about if the test sub-sample contains a label that doesn't appear in the train sub-sample? For example: l_train = [1,2,3,4,5] and l_test=[2,3,4,5,6]. The method above will throw an error because 6 is not in the train sub-sample. How do you solve it? – Ernesto Lopez Fune Jun 24 '21 at 10:00