I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?
-
You identify each textual label with an integer that represents the class it belongs to. If it is not possible than you are not doing classification – Jan K May 06 '18 at 15:21
-
And what is the question? Scikit-learn handles the encoding of text targets on its own. Please explain in detail what you want. – Vivek Kumar May 07 '18 at 04:59
2 Answers
If your target variable is in textual form, you can transform it into numeric form (or you can leave it alone, please see my note below) in order for any Scikit-learn algorithm to pick it in an OVA (One Versus All) scheme: your learning algorithm will try to guess each class as compared against the residual ones only when they will be transformed into numeric codes starting from 0 to (number of classes - 1).
For instance, in this example from the Scikit-Learn documentation, you can figure out the class of your iris because there are three models that evaluate each possible class:
- class 0 versus classes 1 and 2
- class 1 versus classes 0 and 2
- class 2 versus classes 0 and 1
Naturally, classes 0, 1 and 2 are Setosa, Versicolor, and Virginica, but the algorithm needs them expressed as numeric codes, as you can verify by exploring the results of the example code:
list(iris.target_names)
['setosa', 'versicolor', 'virginica']
np.unique(Y)
array([0, 1, 2])
NOTE: it is true that Scikit-learn encodes by itself the target labels if they are strings. On Scikit-learn's Github page for logistic regression (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py) you can see at rows 1623 and 1624 where the code calls the label encoder and it encodes labels automatically:
# Encode for string labels label_encoder = LabelEncoder().fit(y) y = label_encoder.transform(y)

- 1,734
- 18
- 25
-
2Thanks. I tried using SVM, with and without numerical mapping, both seems to give the same result – Nanda kumar May 06 '18 at 19:41
-
2
-
1This was something that I didn't know of. I checked the code of the latest Scikit-learn package and I found out that it calls a LabelEncoder function for any target variable you pass it. – Luca Massaron May 07 '18 at 20:14
-
I did not check the code but I realised that the target variable encoding is necessary if you have installed using conda. But if you have installed using pip, we don't need encoding. I initially executed the code on pip installations and later installed condo for some reason. Then I realised this and had to change the code!!! – Gana Jan 12 '19 at 09:12
-
2While `.fit`, `.transform`, and `.predict` support text targets, some functions don't ie. `metrics.roc_auc_score`. In that case, `LabelEncoder` is necessary: `enc = preprocessing.LabelEncoder().fit(y_test)` `metrics.roc_auc_score(enc.transform(y_test),` `enc.transform(xgb4.predict(X_test)))` – Matt Harrison Feb 20 '19 at 21:51
I would say it is. Scikit-learn does automatic encoding for some of its training and prediction methods, as well as some scoring methods- but not for all. Source code for skl's _encode method code here. I don't know about other libraries but you they might not make this automatic encoding.
If you're working in a commercial environment, I think it would be best if you did encode your labels in the first place, so you don't have to redo your pipeline mid production phase.

- 99
- 1
- 5