62

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the methods of:

  1. dummifying your categorical variables
  2. encoding your categorical variables from e.g. (a,b,c) to (1,2,3)

ALSO:

Would there be any reasons not to go with method 2 by using for example labelencoder?

abhiieor
  • 3,132
  • 4
  • 30
  • 47
ishido
  • 4,065
  • 9
  • 32
  • 42
  • 2
    *"When using XGBoost we need to convert categorical variables into numeric."* Not always, no. If `booster=='gbtree'` (the default), then **XGBoost can handle categorical variables encoded as numeric directly**, without needing dummifying/one-hotting. Whereas if the label is a string (not an integer) then yes we need to comvert it. – smci Nov 13 '19 at 00:31
  • 6
    @smci although this is true, I believe that numeric relationship is preserved. Therefore in an example where 1= Texas and 2=New York, New York would be "greater" which is not correct. – msarafzadeh Jul 08 '20 at 11:42

4 Answers4

73

xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2])

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly

T. Scharf
  • 4,644
  • 25
  • 27
  • 8
    Won't this make features with many categories appear more important than ones with fewer? – Simd Apr 29 '16 at 04:47
  • 1
    How `Xgboost` knows to treat `array([1., 0., ...])` as categorical instead of numeric? – Thiago Jun 20 '16 at 19:44
  • @ThiagoBalbo Simply put: it does not. You just replace the original variable/feature/column with 3 binary variables/features/columns. – masu Aug 04 '16 at 11:56
  • 41
    Assuming that we are talking about using Xgboost for GBDT and not a linear model, This answer is simply not true. Encoding a categorical variable with integer works for xgboost and sometimes (YMMV) out performs a one hot encoding. – B_Miner Aug 15 '16 at 16:42
  • 14
    To the people claiming that tree based split algorithm can tease out categoricals encoded as numeric, they need to understand that xgboost uses gradient based split criterion, so the numeric relationship is preserved, unlike entropy based, where numeric encoding can succeed a bit easier. This can be empirically verified with toy datasets. If you have large numbers of categories, of course one hotting is a bad strategy. – T. Scharf Feb 26 '17 at 21:15
  • 4
    @B_Miner can you explain this further? I really wanted to know how my model using LabelEncoding is actually performing better when compared to the categorical features one-hot encoded? It just doesn't seem right - How do we explain this behavior to business? My category has around ~ 3000 distinct values, so one-hot encoding was bloating the dataset also. – user3000805 Aug 17 '17 at 13:30
  • 5
    *"...or any other machine learning tool"* I don't know about xgboost, but in general this is fundamentally not true, many machine learning tools handle categorical variables directly and using OHE or dummy variables seriously degrades the performance: https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/. Furthermore, there are many many other encoding schemes for categorical variables, and the best encoding will depend on your model as well as your data. – Dan Feb 03 '18 at 01:33
  • 1
    This answer is simply wrong. The "proper way" is not proper, it is just a way that happens to work. – Marcin Sep 20 '18 at 10:43
  • While corresponding with the development team of scikit-learn about an issue, I was told that one should use `OrdinalEncoder` instead of `LabelEncoder`. It follows from this that `OneHotEncoder` is not the only "right way" because I would have been told so otherwise. – readyready15728 Nov 20 '18 at 21:56
  • 1
    great comment.. here's the thing: `Cat, Dog, Tree` maps to `1,2,3` using 2 methods you mention. A tree based method can dissect (split) those apart. A linear model would wrongly interpret a numeric relationship ( and would fail miserably ). Hence , one-hot encoding is the default 'best way' without any other knowledge to represent a categorical (i.e. no numeric relationship) -- obviously there are infinite ways to represent a categorial , some better than others.. – T. Scharf Nov 21 '18 at 16:40
19

I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. While "dummification" creates a very sparse setup, specially if you have multiple categorical columns with different levels, label encoding is often biased as the mathematical representation is not reflective of the relationship between levels.

For Binary Classification problems, a genius yet unexplored approach which is highly leveraged in traditional credit scoring models is to use Weight of Evidence to replace the categorical levels. Basically every categorical level is replaced by the proportion of Goods/ Proportion of Bads.

Can read more about it here.

Python library here.

This method allows you to capture the "levels" under one column and avoid sparsity or induction of bias that would occur through dummifying or encoding.

Hope this helps !

mamafoku
  • 1,049
  • 2
  • 14
  • 28
  • 1
    This seems like a great way to add a new feature to replace the need for the problematic categorical variable. It still does not contain the same information that the original column had, but adding enough of this kind of columns could do the trick. – Heikki Pulkkinen Oct 10 '18 at 07:52
  • This seems very similar to the more general target encoding and variants thereof (see [this](http://contrib.scikit-learn.org/categorical-encoding/index.html) for example) – jerorx Nov 18 '18 at 17:28
8

Nov 23, 2020

XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:

1.8.7 Categorical Data

Other than users performing encoding, XGBoost has experimental support for categorical data using gpu_hist and gpu_predictor. No special operation needs to be done on input test data since the information about categories is encoded into the model during training.

https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf

In the DMatrix section the docs also say:

enable_categorical (boolean, optional) – New in version 1.3.0.

Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split. Also, JSON serialization format, gpu_predictor and pandas input are required.

Jonatan
  • 1,182
  • 8
  • 20
2

Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:

ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
print("Starting DF shape: %d, %d" % df.shape)


for col in ONE_HOT_COLS:
    s = df[col].unique()

    # Create a One Hot Dataframe with 1 row for each unique value
    one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
    one_hot_df[col] = s

    print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
    pre_len = len(df)

    # Merge the one hot columns
    df = df.merge(one_hot_df, on=[col], how="left")
    assert len(df) == pre_len
    print(df.shape)
Roei Bahumi
  • 3,433
  • 2
  • 20
  • 19