You should not label after the split, but before.
The unique labels (= classes) are ordered according to alphabet, see uniques = sorted(set(values))
in this source code snipped from sklearn.preprocessing.LabelEncoder which links to the [source] on the upper right of the page.
python method:
def _encode_python(values, uniques=None, encode=False):
# only used in _encode below, see docstring there for details
if uniques is None:
uniques = sorted(set(values))
uniques = np.array(uniques, dtype=values.dtype)
if encode:
table = {val: i for i, val in enumerate(uniques)}
try:
encoded = np.array([table[v] for v in values])
except KeyError as e:
raise ValueError("y contains previously unseen labels: %s"
% str(e))
return uniques, encoded
else:
return uniques
Same for numpy arrays as classes, see return np.unique(values)
, because unique() sorts by default:
numpy method:
def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
# only used in _encode below, see docstring there for details
if uniques is None:
if encode:
uniques, encoded = np.unique(values, return_inverse=True)
return uniques, encoded
else:
# unique sorts
return np.unique(values)
if encode:
if check_unknown:
diff = _encode_check_unknown(values, uniques)
if diff:
raise ValueError("y contains previously unseen labels: %s"
% str(diff))
encoded = np.searchsorted(uniques, values)
return uniques, encoded
else:
return uniques
You can never be sure that the test set and training set have the exactly same classes. The training or testing set might simply lack a class of the three label column 'Condition'.
If you desparately want to encode after the train/test split, you need to check that the number of classes is the same in both sets before the encoding.
Quoting the script:
Uses pure python method for object dtype, and numpy method for all
other dtypes.
python method (object type):
assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))
numpy method (all other types):
assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])