I'm (extremely) new to using MLR3, and am using it to model flight delays. I have some numerical variables, like Z, and some categorical variables like X. Let's just say I want to do a very simple model predicting delays based on both X and Z. From a theoretical perspective, we would usually encode the X factors into dummy variables, and then model it using linear regression. I see that MLR3 is doing this itself though - for example, when I create a task and run the learner, I can see that it has created coefficients for all the different factors i.e. treating them as separate dummy variables.
However, I can see how many other programmers are still using one-hot encoding to encode their categorical variables into dummies first - thus my question is, is one-hot encoding necessary, or does MLR3 do it for you?
edit: Below is an example dataset of my data. My predictor variables are Y (categorical) and Z (numerical). Y is the dependent variable and is numerical.
Y X Z
-3 M 7.5
5 W 9.2
10 T 3.1
4 T 2.2
-13 M 10.1
2 M 1.7
4 T 4.5
This is the code I use
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
task <- TaskRegr$new('apples', backend=df2, target = 'Y')
set.seed(38)
train_set <- sample(task$nrow, 0.99 * task$nrow)
test_set <- setdiff(seq_len(task$nrow), train_set)
glrn_lm$train(task, row_ids = train_set)
glrn_lm$predict(task, row_ids = test_set)$score()
summary(lm(formula = task$formula(), data = task$data()))
And the results of that line will be something like:
Call:
lm(formula = task$formula(), data = task$data())
Residuals:
Min 1Q Median 3Q Max
-39.62 -8.71 -4.77 0.27 537.12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.888e+00 3.233e+00 1.512 0.130542
XT 4.564e-03 3.776e-04 12.087 < 2e-16 ***
XW 4.564e-03 3.776e-04 12.087 < 2e-16 ***
Z -4.259e+00 6.437e-01 -6.616 3.78e-11 ***
(The numbers up here are all way off - please don't mind that)
So as you can see, it derives two new variables called XT and XW - to denote the factor T under X and the factor W under X. I assume, like in dummy coding, XM is the reference variable here. So like I said earlier, regr_lm seems to already be doing the dummy coding for us. Is that really the case?