3

Lately I have been advised to change machine learning framework to mlr3. But I am finding transition somewhat more difficult than I thought at the beginning. In my current project I am dealing with highly imbalanced data which I would like to balance before training my model. I have found out this tutorial which explains how to deal with imbalance via pipelines and graph learner:

https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/

I am afraid that this approach will also perform class balancing with new data predicting. Why would I want to do this and reduce my testing sample ?

So the two question that are rising:

  1. Am I correct not to balance classes in testing data?
  2. If so, is there a way of doing this in mlr3?

Of course I could just subset the training data manually and deal with imbalance myself but that's just not fun anymore! :)

Anyway, thanks for any answers,
Cheers!

Radbys
  • 400
  • 2
  • 10

1 Answers1

5

to answer your questions:

I am afraid that this approach will also perform class balancing with new data predicting.

This is not correct, where did you get this?

Am I correct not to balance classes in testing data?

Class balancing usually works by adding or removing rows (or adjusting weights). All those steps should not be applied during the prediction step, as we want exactly one predicted value for each row in the data. Weights on the other hand usually have no effect during the prediction phase. Your assumption is correct.

If so, is there a way of doing this in mlr3?

Just use the PipeOpas described in the blog post. During training, it will do the specified over- or under- sampling, while it does nothing during the prediction.

Cheers,

pfistfl
  • 311
  • 1
  • 2
  • That's just my assumption. If you look at the chart from mlr3 book (https://mlr3book.mlr-org.com/pipe-modeling.html) it looks like every part of the graph learner is also processed at the new data. Is class balancing an exception, or am I understanding this incorrectly? – Radbys Feb 16 '21 at 19:11
  • 1
    `PipeOp`s are implemented with the correct steps during `train` and `test` in mind. So in this case, your assumption is incorrect. – pfistfl Feb 17 '21 at 12:13
  • 1
    See [here](https://github.com/mlr-org/mlr3pipelines/blob/507b0ed264cd267678b6b7aecdfab59141829b9e/R/PipeOpClassBalancing.R#L164) for a reference. During `predict` the `PipeOp` just passes on it's input. – pfistfl Feb 17 '21 at 15:40
  • Awesome, thanks for that detailed explanation! Really appreciated. – Radbys Feb 18 '21 at 13:24
  • The documentation says "classweights adds a class weight column. Sample weights are added to each sample according to the target class. Only binary classification tasks are supported". How can we use classweights with multiclass outcome in mlr3? – skan Jul 22 '23 at 15:09