How to improve computing time performance when using caret to train a model over large datasets

Question

I am working with caret function train() in order to develop a support vector machine model. My dataset Matrix has a considerable number of rows 255099 and few columns/variables (8 including response/target variable). Target variable has 10 groups and is a factor. My issue is about the speed to train the model. My dataset Matrix is included next, and also the code I used for the model. I have also used parallel in order to make faster but is not working.

#Libraries
library(rsample)
library(caret)
library(dplyr)
#Original dataframe
set.seed(1854)
Matrix <- data.frame(Var1=rnorm(255099,mean = 20,sd=1),
                     Var2=rnorm(255099,mean = 30,sd=10),
                     Var3=rnorm(255099,mean = 15,sd=11),
                     Var4=rnorm(255099,mean = 50,sd=12),
                     Var5=rnorm(255099,mean = 100,sd=20),
                     Var6=rnorm(255099,mean = 180,sd=30),
                     Var7=rnorm(255099,mean = 200,sd=50),
                     Target=sample(1:10,255099,prob = c(0.15,0.1,0.1,
                                                           0.15,0.1,0.14,
                                                           0.10,0.05,0.06,
                                                           0.05),replace = T))
#Format target variable
Matrix %>% mutate(Target=as.factor(Target)) -> Matrix
# Create training and test sets
set.seed(1854)
strat <- initial_split(Matrix, prop = 0.7,
                             strata = 'Target')
traindf <- training(strat)
testdf <- testing(strat)
#SVM model
#Enable parallel computing
cl <- makePSOCKcluster(7)
registerDoParallel(cl)
#SVM radial basis kernel
set.seed(1854) # for reproducibility
svmmod <- caret::train(
  Target ~ .,
  data = traindf,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
)
#Stop parallel
stopCluster(cl)

Even using parallel, the train() process defined in previous code did not finish. My computer with Windows system, intel core i3 and 6GB RAM was not able to finish this training in 3 days. For 3 days the computer was turned on but the model was not trained and I stopped it.

Maybe I am doing something wrong that is making train() pretty slow. I would like to know if there is any way to boost the training method I defined. Also, I do not know why is taking too much time if there is only 8 variables.

Please, could you help me to solve this issue? I have looked for solutions to this problem without success. Any suggestion on how to improve my training method is welcome. Moreover, some solutions mention that h2o can be used but I do not know how to set up my SVM scheme into that architecture.

Many thanks for your help.

I actually ran into the issue and the issue isn't with caret. you can call kernlab::rsvm and test it on your dataset. you can see it's quite slow. https://stackoverflow.com/questions/30385347/r-caret-unusually-slow-when-tuning-svm-with-linear-kernel — StupidWolf, Jul 06 '20 at 23:02
@StupidWolf many thanks! Could you please teach me how can I include `kernlab::rsvm` inside `train()` configuration? — Duck, Jul 06 '20 at 23:05
sorry @Duck, I may have confused you, so if you do ```getModelInfo("svmRadial")$svmRadial$fit```, you can see what is the underlying function used to fit your model. in this case you will see that it is ```kernlab::ksvm``` — StupidWolf, Jul 06 '20 at 23:08
If you directly use kernlab::ksvm(x=..,y=..) on 50% of your training data, you will see that it takes quite a while.. so that explains why the parallelization doesn't help — StupidWolf, Jul 06 '20 at 23:08
@StupidWolf Thanks, and do you know how can I change the `svm` engine? It looks like `kernlab::rsvm` is the main issue of my code. Your help is gold for me. Again Thanks. — Duck, Jul 06 '20 at 23:12
@StupidWolf And as you helped me a lot of, if you post the way to include a new method for SVM in my code, I will accept your answer inmediately! Infinite thanks. — Duck, Jul 06 '20 at 23:17
Lol @Duck.. I post it as a comment because I am really not sure how to speed it up. Just the rbfdot kernel takes quite a while.. the underlying c-code is beyond me. This seems to be something possible to try https://cran.r-project.org/web/packages/liquidSVM/vignettes/demo.html — StupidWolf, Jul 06 '20 at 23:23
Personally (not the answer you want to hear I guess), I use scikit-learn usually.. for big datasets... — StupidWolf, Jul 06 '20 at 23:24
@StupidWolf Oh I understand, I need to do this with `R`. I should look for another option. But thanks for all your help. — Duck, Jul 06 '20 at 23:29
[LiquidSVM](https://cran.r-project.org/web/packages/liquidSVM/index.html) is a relatively fast svm implementation. I don't think its implemented in caret but it has its own hyper parameter tuning interface. Are you limited to SVM? In my experience a simple random forest is always at least as good as a svm - [ranger](https://cran.r-project.org/web/packages/ranger/) is a good and fast implementation. — missuse, Jul 07 '20 at 21:02
@missuse Many thanks for your answer. Yeah, I am in a project whose goal is to compare SVM with other ML models as RF. I will try your suggestion. I looked for `caret` implementation because of the way for tuning performance. Is there any way you could post a `LiquidSVM` training scheme similar to mine? I will accept as answer because as you stated that method is faster. — Duck, Jul 07 '20 at 21:08
I think it would be best for your purpose to implement LiquidSVM as a custom model in caret. If I get inspiration for SO and if my schedule allows it I can attempt it, but I suggest you start yourself, it will definitely enhance your R knowledge. Recommended reading prior to doing it: https://topepo.github.io/caret/using-your-own-model-in-train.html, mlr has it implemented you might want to check it: https://rdrr.io/cran/liquidSVM/man/mlr-liquidSVM.html. Also check: https://cran.r-project.org/web/packages/liquidSVM/vignettes/demo.html — missuse, Jul 07 '20 at 21:20
You can also use mlr instead of caret which already supports liquidSVM but then you will have to learn mlr which is deprecated and superseded by the awesome mlr3. Or you can opt to use mlr3 and implement liquidSVM in it. In the long run the last proposition is probably the wisest if you plan to use R for ML. — missuse, Jul 07 '20 at 21:21
@missuse And pretty nice if you have some time to implement LiquidSVM in `caret` in order to help. You are great! — Duck, Jul 07 '20 at 21:31
@missuse And as I like `R` I will try myself to implement LiquidSVM as custom model in `caret` :) — Duck, Jul 07 '20 at 21:35

How to improve computing time performance when using caret to train a model over large datasets

0 Answers0