5

I'm totally novice on scikit-learn.

I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below

from sklearn import preprocessing

# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] )    # labeling from string
....
1. Using same label encoder
   df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )

2. Using different label encoder
   le_for_test_blood_type = preprocessing.LabelEncoder()
   df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )

Which one is right code? Or, whatever I choose the above's code it does not make any differences because training dataset's categorical data and test dataset's categorical data should be the same as a result.

mac475
  • 169
  • 3
  • 6
  • If you want to do the `fit_transform()` in a programme and to do the `transform()` in another programme please check this answer https://stackoverflow.com/questions/28656736/using-scikits-labelencoder-correctly-across-multiple-programs/55895639#55895639 – Shady Mohamed Sherif Apr 29 '19 at 00:26

2 Answers2

9

The problem is the way you use it in fact.

As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.

The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)

from official doc

RPresle
  • 2,436
  • 3
  • 24
  • 28
  • Thanks for your answer. So, you mean LabelEncoder should be fitted only once with training dataset's categorical feature. And it should have all nominal feature. If so, after that, can it be used to test dataset? – mac475 Jul 01 '15 at 12:03
  • Once your label Encore is fitted with all your nominal feature, you can use it with the transform method for every dataset you want – RPresle Jul 01 '15 at 12:33
  • ok. clearly understood. thank you for your kind and quick answer. – mac475 Jul 01 '15 at 12:51
  • What if i have 2 different dataframes for test and train data ? I have 2 different csv which i have loaded in two different dataframe. Does that mean i will have to merge and then then perform labelEncoding and then do some demerge ? – Invictus Jul 31 '20 at 10:24
6

I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:

In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.

Undecided
  • 611
  • 8
  • 13