0

I'm training a model using XGBoost in Python. One of my variables is month (Jan, Feb, Mar, ...)

What is the correct way to do this, and would there be any difference in performance/evaluation metrics between the methods of:

  1. One hot encoding of the 12 months, so 12 new variables taking on either 0 or 1.
  2. One variable using the numeric values of 1,2,...12 representing the months.

I read this very similar question. However there are comments supporting each method. There is no conclusive answer.

I know that in general machine learning models you are not supposed to do (2) because the model will assume an ordinal relationship between the variables. This raises another confusion for me. My variable, month, can be argued to be ordinal. For example, June (number 6) is later in the year than January (number 1), however, June is not of "greater importance" than January.

Please feel free to share your thoughts or any links to academic discussions on this. Thank You.

tpoh
  • 261
  • 3
  • 11

1 Answers1

1

Hot take - you can do both! Apply some sort of ordinal encoding to your "name of month" column, and then insert two copies of its output to your training data frame; let XGBoost know that one copy should be interpreted as "continuous+numeric", and the other as "categorical". XGBoost will then give a definitive answer to your question ("which encoding scheme is better").

Please note that for maximum effect you should be enabling XGBoost's native categorical feature support (XGBoost 1.6+, enable_categorical = True, etc). The native encoding performs splits on category groups, which is clearly superior in comparison with OHE.

The correct representation of DayOfMonth, WeekOfYear, MonthOfYear etc. features would be "cyclic+ordinal+numeric". So, whatever you do currently, you're already distorting information.

user1808924
  • 4,563
  • 2
  • 17
  • 20