0

I'm (extremely) new to using MLR3, and am using it to model flight delays. I have some numerical variables, like Z, and some categorical variables like X. Let's just say I want to do a very simple model predicting delays based on both X and Z. From a theoretical perspective, we would usually encode the X factors into dummy variables, and then model it using linear regression. I see that MLR3 is doing this itself though - for example, when I create a task and run the learner, I can see that it has created coefficients for all the different factors i.e. treating them as separate dummy variables.

However, I can see how many other programmers are still using one-hot encoding to encode their categorical variables into dummies first - thus my question is, is one-hot encoding necessary, or does MLR3 do it for you?

edit: Below is an example dataset of my data. My predictor variables are Y (categorical) and Z (numerical). Y is the dependent variable and is numerical.

 Y    X    Z
-3    M    7.5
 5    W    9.2
 10   T    3.1
 4    T    2.2
 -13  M    10.1
 2    M    1.7
 4    T    4.5

This is the code I use

library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
task <- TaskRegr$new('apples', backend=df2, target = 'Y')
set.seed(38)
train_set <- sample(task$nrow, 0.99 * task$nrow)
test_set <- setdiff(seq_len(task$nrow), train_set)
glrn_lm$train(task, row_ids = train_set)
glrn_lm$predict(task, row_ids = test_set)$score()
summary(lm(formula = task$formula(), data = task$data()))

And the results of that line will be something like:

Call:
lm(formula = task$formula(), data = task$data())

Residuals:
   Min     1Q Median     3Q    Max 
-39.62  -8.71  -4.77   0.27 537.12 

Coefficients:
                                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                4.888e+00  3.233e+00   1.512 0.130542    
XT                                         4.564e-03  3.776e-04  12.087  < 2e-16 ***
XW                                         4.564e-03  3.776e-04  12.087  < 2e-16 ***
Z                                         -4.259e+00  6.437e-01  -6.616 3.78e-11 ***
 

(The numbers up here are all way off - please don't mind that)

So as you can see, it derives two new variables called XT and XW - to denote the factor T under X and the factor W under X. I assume, like in dummy coding, XM is the reference variable here. So like I said earlier, regr_lm seems to already be doing the dummy coding for us. Is that really the case?

  • Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), preferably using the [reprex package](https://github.com/tidyverse/reprex). `mlr3` does not automatically encode factor variables. The upstream package of the `Learner` might do this. – be-marc Feb 04 '22 at 10:28

1 Answers1

3

In general, mlr3 doesn't automatically encode your categorical factors for you. Whether using categorical features works out of the box depends on the learner you're using -- some, like the linear regression you're using, can work with categorical features directly, while others can't (and if you try to use those you'd get an error message indicating that).

In general, there's no downside to one-hot-encoding your categorical features, so if you want to try many different learners I'd recommend doing that so that you don't have to worry about whether a particular learner requires it.

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
  • Yes what you're saying makes sense, right now I'm using regr_lm and so maybe that's why it works. But it probably won't with ridge regression and all. Can you perhaps tell me how to do one-hot encoding though? I have no idea – Academic005 Feb 07 '22 at 01:47
  • There's a great answer for this and other issues that may arise here: https://stackoverflow.com/questions/60620158/using-mlr3-pipelines-to-impute-data-and-encode-factor-columns-in-graphlearner – Lars Kotthoff Feb 07 '22 at 16:10