1

Is it better to one-hotencode or just leave it as a single numeric variable? I'm reading mixed conclusions on the net:

"Avoid OneHot for high cardinality columns and decision tree-based algorithms." https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

as opposed to

"(onehotencoded) This is the proper representation of a categorical variable for xgboost or any other machine learning tool." XGBoost Categorical Variables: Dummification vs encoding

Niv
  • 850
  • 1
  • 7
  • 22
  • 1
    "Hour of day" has 24 possible category levels, which probably isn't a "high cardinality" territory yet? The exact operational type of HoD - continuous, categorical or ordinal - is also dataset-dependent. – user1808924 Nov 19 '19 at 14:53
  • I like to keep the number of features small to reduce the likelihood of overfitting, adding 23 features feels like quite a lot to me. One hot encoding is typically not done for ordinal variables, i.e. if you one hot encode you will lose the info that 2pm is very close to 3 pm for example. I think you will certainly learn something from trying out both! – Anton Nov 19 '19 at 15:04
  • Thanks for the comments but I was looking for a more generic answer. From the links I posted It seems like there are two schools of thought each opposing the other. – Niv Nov 20 '19 at 08:28

1 Answers1

2

There are more than 2 schools of thought :). In practice, there are pros and cons to everything and the optimal approach will depend on your data. So the usual path forward is to try all feasible options and choose the one that suits your use case best (not only in terms of metrics, but also in terms of CPU/RAM, if the data and not tiny)

For example, OHE will add multiple columns, which can lead to a large memory footprint in the case of long tables. At the same time OHE looses ordinal information (if feature was ordinal). But this might not be a problem, as trees often tick up relevant dependencies of target on the fly. On the other side, simple ordered numeric representation of the hour keeps memory low and keeps ordered sequence of values. But the issues are that it looses the information about 1 hour following 24, it will work with tree boosters in xgboost, but not with linear booster in xgboost or with other model families outside of xgboost (linear, svm, etc.), and it is not theoretically sound for non-ordinal features (your question seemed general).

Let me add the third school of thought that is applicable in this particular case: you can use cyclic encoding of features that have repetitive cycles (month of the year, hour of the day, etc.). The concept is to use sin and cos functions to encode each value with a fixed period (24 in the case of hour of the day). This allows to keep continuity on the edges and keeps memory under control (only 2 features instead of original numerical ordered representation) and the number of encoded features does not depend on cardinality. There are many discussions that one can find googling, for example, this question: https://datascience.stackexchange.com/q/5990/53060. And I'm sure that there are many implementations of it on the web, I personally use this one in python: https://github.com/MaxHalford/xam/blob/master/docs/feature-extraction.md#cyclic-features. Of course, this does not apply to numerical categorical data in general, but to hour of the day specifically.

But as said on the beginning, I personally would try all of them and see which fits best to the problem at hand. Cyclic encoding can be most conceptually sound for the hour of the day, but might perform worse then other approaches and would be meaningless for a feature like "age group".

Mischa Lisovyi
  • 3,207
  • 18
  • 29