I came across this article Label encoding across multiple columns in scikit-learn and one of the comments https://stackoverflow.com/a/30267328/10058906 explained how each value for a given column is encoded from the range of 0 to (n-1) where n is the length of the column.
It raised a question on when I encode red: 2
, orange: 1
and green: 0
does it imply that green is closer to orange than red since 0 is closer to 1 than 2; which in reality is not true? I earlier thought perhaps since green
occurs the maximum number of times, it gets the value 0
. But, this does not hold for the column fruit
where apple gets value 0
even though orange occurs the maximum number of times
.

- 307
- 1
- 17
1 Answers
I would like to summarize Label Encoder and One Hot Encoding:
It is true that Label Encoder simply gives an integral representation to a cell value. This implies that for the above dataset if we label encode our categorical values - it would imply that green is closer to orange than red since 0 is closer to 1 than 2
- which is false.
On the other hand, One Hot Encoding creates a separate column for each categorical value, and a value of either 0 or 1 is given representing the absence or presence of that feature respectively. Also, the in-built function of pd.get_dummies(dataframe)
produces the same output.
Hence, if the given dataset contains categorical values which are ordinal in nature, it is wise to use Label Encoding
; but if the given data is nominal, one should go forward with One Hot Encoding
.
https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/2

- 307
- 1
- 17