I am working with caret
function train()
in order to develop a support vector machine model. My dataset Matrix
has a considerable number of rows 255099
and few columns/variables (8
including response/target variable). Target variable has 10
groups and is a factor. My issue is about the speed to train the model. My dataset Matrix
is included next, and also the code I used for the model. I have also used parallel
in order to make faster but is not working.
#Libraries
library(rsample)
library(caret)
library(dplyr)
#Original dataframe
set.seed(1854)
Matrix <- data.frame(Var1=rnorm(255099,mean = 20,sd=1),
Var2=rnorm(255099,mean = 30,sd=10),
Var3=rnorm(255099,mean = 15,sd=11),
Var4=rnorm(255099,mean = 50,sd=12),
Var5=rnorm(255099,mean = 100,sd=20),
Var6=rnorm(255099,mean = 180,sd=30),
Var7=rnorm(255099,mean = 200,sd=50),
Target=sample(1:10,255099,prob = c(0.15,0.1,0.1,
0.15,0.1,0.14,
0.10,0.05,0.06,
0.05),replace = T))
#Format target variable
Matrix %>% mutate(Target=as.factor(Target)) -> Matrix
# Create training and test sets
set.seed(1854)
strat <- initial_split(Matrix, prop = 0.7,
strata = 'Target')
traindf <- training(strat)
testdf <- testing(strat)
#SVM model
#Enable parallel computing
cl <- makePSOCKcluster(7)
registerDoParallel(cl)
#SVM radial basis kernel
set.seed(1854) # for reproducibility
svmmod <- caret::train(
Target ~ .,
data = traindf,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10
)
#Stop parallel
stopCluster(cl)
Even using parallel
, the train()
process defined in previous code did not finish. My computer with Windows system, intel core i3 and 6GB RAM was not able to finish this training in 3 days. For 3 days the computer was turned on but the model was not trained and I stopped it.
Maybe I am doing something wrong that is making train()
pretty slow. I would like to know if there is any way to boost the training method I defined. Also, I do not know why is taking too much time if there is only 8
variables.
Please, could you help me to solve this issue? I have looked for solutions to this problem without success. Any suggestion on how to improve my training method is welcome. Moreover, some solutions mention that h2o
can be used but I do not know how to set up my SVM
scheme into that architecture.
Many thanks for your help.