I'm training a model using XGBoost in Python. One of my variables is month (Jan, Feb, Mar, ...)
What is the correct way to do this, and would there be any difference in performance/evaluation metrics between the methods of:
- One hot encoding of the 12 months, so 12 new variables taking on either 0 or 1.
- One variable using the numeric values of 1,2,...12 representing the months.
I read this very similar question. However there are comments supporting each method. There is no conclusive answer.
I know that in general machine learning models you are not supposed to do (2) because the model will assume an ordinal relationship between the variables. This raises another confusion for me. My variable, month, can be argued to be ordinal. For example, June (number 6) is later in the year than January (number 1), however, June is not of "greater importance" than January.
Please feel free to share your thoughts or any links to academic discussions on this. Thank You.