0

I came across this article Label encoding across multiple columns in scikit-learn and one of the comments https://stackoverflow.com/a/30267328/10058906 explained how each value for a given column is encoded from the range of 0 to (n-1) where n is the length of the column. It raised a question on when I encode red: 2, orange: 1 and green: 0 does it imply that green is closer to orange than red since 0 is closer to 1 than 2; which in reality is not true? I earlier thought perhaps since green occurs the maximum number of times, it gets the value 0. But, this does not hold for the column fruit where apple gets value 0 even though orange occurs the maximum number of times.

1 Answers1

0

I would like to summarize Label Encoder and One Hot Encoding:

It is true that Label Encoder simply gives an integral representation to a cell value. This implies that for the above dataset if we label encode our categorical values - it would imply that green is closer to orange than red since 0 is closer to 1 than 2 - which is false.

On the other hand, One Hot Encoding creates a separate column for each categorical value, and a value of either 0 or 1 is given representing the absence or presence of that feature respectively. Also, the in-built function of pd.get_dummies(dataframe) produces the same output.

Hence, if the given dataset contains categorical values which are ordinal in nature, it is wise to use Label Encoding; but if the given data is nominal, one should go forward with One Hot Encoding.

https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/2