0

I want to know how does LabelEncoder() function. This is a part of my code

for att in all_features_test:
if (str(test_home_data[att].dtypes) == 'object'):
    test_home_data[att].fillna( 'Nothing', inplace = True)
    train_home_data[att].fillna( 'Nothing', inplace = True)

    train_home_data[att] = LabelEncoder().fit_transform(train_home_data[att])
    test_home_data[att] = LabelEncoder().fit_transform(test_home_data[att])
else:
    test_home_data[att].fillna( 0, inplace = True)
    train_home_data[att].fillna( 0, inplace = True)

Both train and test data set has an attribute 'Condition' which can hold values - Bad, Average and Good

Lets say LabelEncoder() would encode Bad as 0, Average as 2, and Good as 1 in train_home_data. Now would that be same for test_home data?

If not, then what should I do?

  • Does this answer your question? [How to prevent LabelEncoder from sorting label values?](https://stackoverflow.com/questions/58893912/how-to-prevent-labelencoder-from-sorting-label-values) – questionto42 Oct 12 '20 at 14:13
  • The idea is that you set (-> fit) the encoder once, for example on the training dataset, and then apply it (without re-fitting or changing it) to the test dataset. – felice Oct 15 '20 at 07:44
  • @felice Thank you, I had not got that. Then again, the same issue, you must be sure that you cover all possible attributes of a dimension in both datasets. And there is no guarantee for that, you need to check that both datasets have the same unique attributes to be encoded, else the encoder might find an attribute in the testing set that is not known. – questionto42 Oct 16 '20 at 09:46
  • You can easily remove all datapoints from your test set which have labels that are not available in the training set - since you will not be able to classify them anyhow. That should solve the problem. – felice Oct 17 '20 at 11:02

3 Answers3

0

You should not label after the split, but before.

The unique labels (= classes) are ordered according to alphabet, see uniques = sorted(set(values)) in this source code snipped from sklearn.preprocessing.LabelEncoder which links to the [source] on the upper right of the page.

python method:

def _encode_python(values, uniques=None, encode=False):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        uniques = sorted(set(values))
        uniques = np.array(uniques, dtype=values.dtype)
    if encode:
        table = {val: i for i, val in enumerate(uniques)}
        try:
            encoded = np.array([table[v] for v in values])
        except KeyError as e:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(e))
        return uniques, encoded
    else:
        return uniques

Same for numpy arrays as classes, see return np.unique(values), because unique() sorts by default:

numpy method:

def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        if encode:
            uniques, encoded = np.unique(values, return_inverse=True)
            return uniques, encoded
        else:
            # unique sorts
            return np.unique(values)
    if encode:
        if check_unknown:
            diff = _encode_check_unknown(values, uniques)
            if diff:
                raise ValueError("y contains previously unseen labels: %s"
                                 % str(diff))
        encoded = np.searchsorted(uniques, values)
        return uniques, encoded
    else:
        return uniques

You can never be sure that the test set and training set have the exactly same classes. The training or testing set might simply lack a class of the three label column 'Condition'.

If you desparately want to encode after the train/test split, you need to check that the number of classes is the same in both sets before the encoding.

Quoting the script:

Uses pure python method for object dtype, and numpy method for all other dtypes.

python method (object type):

assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))

numpy method (all other types):

assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])
questionto42
  • 7,175
  • 4
  • 57
  • 90
  • train_home_data and test_home_data are two different data sets. Attributes of test data set are subset of train data set – Chaitanya Thombare Oct 13 '20 at 08:28
  • @ChaitanyaThombare Yes they are two different datasets. But the test set is usually not a subset of the train dataset, train and test set are split from the original dataset. If you really take the test set just as the subset of the train set, your test set is not valid. And I do not understand how the attributes can be a subset of the train data set. If you mean that you can guarantee that both datasets have the same attributes, then you can be sure that the encoding will be the same in train and test set labels. – questionto42 Oct 13 '20 at 11:09
  • Genuine apology for being unclear. The case is that train data set has 85 attributes and test data set has 68 attributes. All attributes in test data set exist in train data set. Both of these data sets are provided via two seperate csv files. – Chaitanya Thombare Oct 13 '20 at 16:52
  • What do you mean by attributes. Do you mean the labels or the classes? I guess you mean the labels, since you will probably not have 68 classes. And then again, it simply must be guaranteed in both datasets that their labels cover the same classes. If the training set has one class in addition to [Average, Bad, Good] = [0,1,2], for example "Above average", the alphabetical order would be [Average, Above average, Bad, Good], and then the encoding [0,1,2,3] would differ. So you must make sure that the exactly same classes are covered in both sets, therefore the assert statements above. – questionto42 Oct 13 '20 at 19:10
  • I see that I have misunderstood you here, you really mean 68 or 85 attribute columns, not the label column that needs to be encoded. Now I also see your point with the subset. So you test your model with less attribute columns than it was trained? Does that work at all? The script should normally crash then, telling you something like "sizes / dimensions are not the same" or the like. If the attribute columns in the test set have the same content as the ones in the train set, the encoding will give you the exactly same output. Again, this makes no sense for a model, though. – questionto42 Oct 14 '20 at 08:45
  • I create a list of comman columns of both data sets and work with columns in the list. So my model works for sure. Never mind, I have got my answer – Chaitanya Thombare Oct 14 '20 at 20:18
  • If you work with columns in the list, your list simply seems to cover the feature columns and perhaps the label column of the model. And whatever (feature or label) column you encode, you will have to make sure that the original classes should be the same in training and testing set, else the encoding will differ. It is not clear to me what gave you the answer now. If it is the answer above, please accept it. If it is coming from a comment, please make this clear and I will change the answer. If it is something you found out on your own, please answer yourself and accept. Thank you. – questionto42 Oct 15 '20 at 07:30
0

I got the answer for this I guess.

Code

data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]

df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])

print(df1['col1'])
print(df2['col1'])

df1['col1'] = LabelEncoder().fit_transform(df1['col1'])
df2['col1'] = LabelEncoder().fit_transform(df2['col1'])

print(df1['col1'])
print(df2['col1'])

Output

0    A
1    B
2    C
3    D
Name: col1, dtype: object # df1
0    D
1    A
2    A
3    B
Name: col1, dtype: object # df2
0    0
1    1
2    2
3    3
Name: col1, dtype: int64 #df1 encoded
0    2
1    0
2    0
3    1
Name: col1, dtype: int64 #df2 encoded

B of df1 is encoded to 1.

and,

B of df2 is encoded to 1 as well

So if I encode training and testing data sets, then the encoded values in training set would reflect in testing data set (only if both are label encoded)

  • But you are fitting two different LabelEncoders on two different datasets `df1`, `df2`. How do you make sure, the order stays the same? My suggestion: `.fit_transform()` on df1, `.transform()` on df2. – felice Oct 19 '20 at 07:20
  • As @felice asked, I tried ' fit_transform() ' method on df1 and ' transform ' method on df2. I get an error:- " This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator. " – Chaitanya Thombare Oct 20 '20 at 08:36
  • I have added a solution which explains what I mean. Hope it helps. – felice Oct 20 '20 at 14:43
0

I would suggest fitting the label encoder on one dataset and transforming both:

data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]

df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])

# here comes the new code:
le = LabelEncoder()
df1['col1'] = le.fit_transform(df1['col1'])
df2['col1'] = le.transform(df2['col1'])
felice
  • 1,185
  • 1
  • 13
  • 27
  • Okay. Now i see what you meant. Now my issue is that if i put 'LabelEncoder()' instead of 'le', it wont work. How is it different? – Chaitanya Thombare Oct 20 '20 at 19:11
  • With `LabelEncoder()`, you are initializing a new label encoder every time. By using the same one, in that case, `le`, you solve that problem. Please mark this answer as the correct answer if you feel it is. – felice Oct 21 '20 at 20:47