21

I wanted to know the difference between sklearn LabelEncoder vs pandas get_dummies. Why would one choose LabelEncoder over get_dummies. What is the advantage of using one over another? Disadvantages?

As far as I understand if I have a class A

ClassA = ["Apple", "Ball", "Cat"]
encoder = [1, 2, 3]

and

dummy = [001, 010, 100]

Am I understanding this incorrectly?

ayhan
  • 70,170
  • 20
  • 182
  • 203
Sam
  • 1,206
  • 2
  • 12
  • 27
  • 3
    The equivalent of `get_dummies` is [`OneHotEncoder`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) by the way. – ayhan Jul 16 '16 at 17:35

2 Answers2

23

These are just convenience functions falling naturally into the way these two libraries tend to do things, respectively. The first one "condenses" the information by changing things to integers, and the second one "expands" the dimensions allowing (possibly) more convenient access.


sklearn.preprocessing.LabelEncoder simply transforms data, from whatever domain, so that its domain is 0, ..., k - 1, where k is the number of classes.

So, for example

["paris", "paris", "tokyo", "amsterdam"]

could become

[0, 0, 1, 2]

pandas.get_dummies also takes a Series with elements from some domain, but expands it into a DataFrame whose columns correspond to the entries in the series, and the values are 0 or 1 depending on what they originally were. So, for example, the same

["paris", "paris", "tokyo", "amsterdam"]

would become a DataFrame with labels

["paris", "tokyo", "amsterdam"]

and whose "paris" entry would be the series

[1, 1, 0, 0]

The main advantage of the first method is that it conserves space. Conversely, encoding things as integers might give the impression (to you or to some machine learning algorithm) that the order means something. Is "amsterdam" closer to "tokyo" than to "paris" just because of the integer encoding? probably not. The second representation is a bit clearer on that.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thank you for the clarification. If you were to work on the classification problem, would you use get_dummy on your response variable as well or it's better to use Label encoder? – Sam Jul 16 '16 at 19:07
  • 3
    As a rule of thumb, if the classes didn't have a natural order, then dummy variables, but the major consideration is what your algorithm implementation is expecting, though. You might want to open a different question describing a bit the problem, and which specific classification you're planning on using (preferably even specifying a specific function in a library). – Ami Tavory Jul 16 '16 at 19:32
  • also , I think if we have many number of categorical classes and we wish to have a better performance, then also we should use Label Encoding – Rahul Ranjan Apr 02 '21 at 18:32
  • 1
    Encoding variables as integers only matters if you use regression. In classification, we use methods that are suited for qualitative/categorical response values to make the prediction, hence the 'distance' between the encoding does not really matter. (Source: [Introduction to Statistical Learning](https://www.statlearning.com/), chapter 4, section 4.2) – user42 Apr 17 '21 at 07:46
6

pandas.get_dummies is one-hot encoding but sklearn.preprocessing.LabelEncoder is incremental encoding, such as 0,1,2,3,4,...

one-hot encoding is more suitable for machine learning. Because labels are independent to each other, e.g. 2 doesn't mean twice that value of 1.

If the training set and test set have a different number of classes for the same feature, please refer to Keep same dummy variable in training and testing data for two solutions.

Yuchao Jiang
  • 3,522
  • 30
  • 23