4

I have just started using mlr3 and still very unfamiliar with the syntax, I have two questions:

  1. How can I access the coefficient from the trained Logistic Regression in mlr3?
  2. I am dealing with a extremely imbalanced dataset, 98% vs 2%, and there are over 2million rows in this dataset, I tried to use SMOTE method, but it is very slow, because it can be done very soon in python, so is there any mistake in my code? Here is my code:
task = TaskClassif$new("pcs",backend =pcs,target = "navigator",positive = "1" )
table(task$truth())

po_over = po("classbalancing",id="oversample",adjust="minor",reference="minor",shuffle=F,ratio=16)
table(po_over$train(list(task))$output$truth())

learner = mlr_learners$get("classif.rpart")
learner$predict_type = "prob"

learner = po_over %>>% learner

resampling = rsmp("holdout",ratio=0.8)

rr = resample(task,learner,resampling,store_models = T)

res <- rr$prediction()
auto1 <- autoplot(res)
auto2 <- autoplot(res,type='roc')

rr$score(msr("classif.acc"))$classif.acc %>% print()

and for the SMOTE:

gr_smote =
  po("colapply", id = "int_to_num",
    applicator = as.numeric, affect_columns = selector_type("integer")) %>>%
  po("smote", dup_size = 15) %>>%
  po("colapply", id = "num_to_int",
    applicator = function(x) as.integer(round(x, 0L)), affect_columns = selector_type("numeric"))
Carl
  • 109
  • 8
  • I'm guessing that you have not offered a [MCVE] but since you didn't include any `library` calls I cannot really be sure. Did you imagine that we would be able to run this code? And did you search SO for matches to "[r] coefficient mlr"? If you wnat a more theoretic or strategic advice form then perhaps go the SE::Data Science. Here we are all about coding hard examples. – IRTFM Mar 14 '21 at 22:35
  • I am no expert on this, but "classif.rpart" is CART (random forest with a single tree), not logistic regression. Specify "classif.log_reg" if you want to use logistic regression. If you want to do logistic regression, use `glm` and it will give you a model with parameters. For here, you can see the model by doing `learner$model` after training it. It will show you a series of decisions. It is non parametric-there is no formula/coefficients. – Vons Mar 14 '21 at 22:44
  • @IRTFM thanks for the advise, I did search that way, and found ```getLearnerModel()```, but I'm not sure if it can be used in MLR3 package. Sorry about I didn't offer a minimal example, because SMOTE works fast enough for small data set(for example 10000*8), but for a million rows with 8 feature it went slow. I tried same dataset in python using ```imblearn``` package, and it takes about 3s. I just want to know if there exist some setup to speed up SMOTE method in r. – Carl Mar 14 '21 at 22:46
  • @Stacker, thanks for help, I made a mistake in that place, but learner$model return me a null :( – Carl Mar 14 '21 at 22:54

1 Answers1

4

Here's what I gathered for your question #1

  1. Create a data set with approximately 98% 1's and 2% 0's

  2. Make tasks of training and testing

  3. (1) Create overbalancing po thing

    (2) Create learner this way, the way in you original code won't work with po

  4. Train the learner on train set

  5. Test on test set

library(mlr3)
library(dplyr)
library(mlr3pipelines)
set.seed(10)

pcs=data.frame(a=runif(1000), b=runif(1000))
pcs = pcs %>%
  mutate(c=2*a+3*b, d=ifelse(c>.6, 1, 0), navigator=factor(d)) %>%
  select(-c, -d)

task = TaskClassif$new("pcs",backend =pcs,target = "navigator",positive = "1" )
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)

task_train <- task$clone()$filter(train_set)
task_test  <- task$clone()$filter(test_set)

po_over1= po("classbalancing")
po_over1$param_set$values=list(ratio=16, reference="minor", adjust="minor", shuffle=FALSE)

learner=GraphLearner$new(
  po_over1 %>>% 
    po("learner", lrn("classif.rpart", 
                      predict_type="prob"))
)

learner$train(task_train)

pred=learner$predict(task_test)

output:

learner$model
#' You can see the predicted probability by following the decision tree
#' e.g. say you have a data point a and b
#' first check that b>=.112 or b<.112 (nodes 2 and 3)
#' etc.
1) root 1085 304 1 (0.71981567 0.28018433)  
  2) b>=0.1122314 728  16 1 (0.97802198 0.02197802)  
    4) a>=0.007176245 709   0 1 (1.00000000 0.00000000) *
    5) a< 0.007176245 19   3 0 (0.15789474 0.84210526) *
  3) b< 0.1122314 357  69 0 (0.19327731 0.80672269)  
    6) a>=0.246552 65   0 1 (1.00000000 0.00000000) *
    7) a< 0.246552 292   4 0 (0.01369863 0.98630137) *

#Test predictions
pred$confusion
        truth
response   1   0
       1 195   1
       0   0   4

This is for question #2 SMOTE

gr_smote =
  po("colapply", id = "int_to_num",
     applicator = as.numeric, affect_columns = selector_type("integer")) %>>%
  po("smote", dup_size = 15) %>>%
  po("colapply", id = "num_to_int",
     applicator = function(x) as.integer(round(x, 0L)), affect_columns = selector_type("numeric"))

learner=GraphLearner$new(
  gr_smote %>>% po("learner", lrn("classif.rpart", predict_type="prob"))
)
learner$train(task_train)
learner$model
1) root 1085 304 1 (0.7198157 0.2801843)  
  2) b>=0.5 391   0 1 (1.0000000 0.0000000) *
  3) b< 0.5 694 304 1 (0.5619597 0.4380403)  
    6) a>=0.5 203   0 1 (1.0000000 0.0000000) *
    7) a< 0.5 491 187 0 (0.3808554 0.6191446) *

pred=learner$predict(task_test)
pred$confusion
        truth
response   1   0
       1 159   0
       0  36   5
Vons
  • 3,277
  • 2
  • 16
  • 19
  • Amazing, thanks for your help Stacker. But I have been running the SMOTE method for over 30 minutes and it hasn't finished, which should be very fast on Python, do you know why this is happening? Is there anything I could setup for SMOTE to make it faster? Thanks! – Carl Mar 15 '21 at 10:56
  • I dunno, It’s probably because your dataset is huge? You can try undersampling your minority class because 2% of 2 million is still 40,000 rows. Try install.packages(“DMwR”); pcs =SMOTE(navigator ~ ., data, perc.over = 10,perc.under=100); learner = mlr_learners$get("classif.rpart") learner$predict_type = "prob"; learner$train(task_train) or even perc.over=1 – Vons Mar 15 '21 at 15:31
  • It still very slow, but thanks for your help, appreciate! – Carl Mar 15 '21 at 22:27