8

I am probably making a very simple (and stupid) mistake here but I cannot figure it out. I am playing with some data from Kaggle (Digit Recognizer) and trying to use SVM with the Caret package to do some classification. If I just plug the label values into the function as type numeric, the train function in Caret seems to default to regression and performance is quite poor. So what I tried next is to convert it to a factor with the function factor() and try and run SVM classification. Here is some code where I generate some dummy data and then plug it into Caret:

library(caret)
library(doMC)
registerDoMC(cores = 4)

ytrain <- factor(sample(0:9, 1000, replace=TRUE))
xtrain <- matrix(runif(252 * 1000,0 , 255), 1000, 252)

preProcValues <- preProcess(xtrain, method = c("center", "scale"))
transformerdxtrain <- predict(preProcValues, xtrain)

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
svmFit <- train(transformerdxtrain[1:10,], ytrain[1:10], method = "svmradial")

I get this error:

Error in kernelMult(kernelf(object), newdata, xmatrix(object)[[p]], coef(object)[[p]]) : 
  dims [product 20] do not match the length of object [0]
In addition: Warning messages:
1: In train.default(transformerdxtrain[1:10, ], ytrain[1:10], method = "svmradial") :
  At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1, X2, X3, X4, X5, X6, X7, X8, X9
2: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method,  :
  There were missing values in resampled performance measures.

Can somebody tell me what I am doing wrong? Thank you!

mchangun
  • 9,814
  • 18
  • 71
  • 101
  • The error message is pretty self explanatory, isn't it? Call your factor levels something other than 0, 1,...9. – joran Dec 17 '12 at 14:53
  • @joran the warning message, isn'it? – agstudy Dec 17 '12 at 14:54
  • @agstudy Yes, thank you. That's certainly an embarrassing warning (oops!, I mean error!) on my part! :) – joran Dec 17 '12 at 15:02
  • @joran I tried using `labels = letters[1:10]` and I get a different error. "Error in .local(object, ...) : test vector does not match model ! In addition: Warning message: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, : There were missing values in resampled performance measures." – mchangun Dec 17 '12 at 15:03
  • 1
    @mchangun it is better to update your answer, than doing it in the comment. – agstudy Dec 17 '12 at 15:06
  • +1 for Kaggle questions, love seeing people play with their data out in the open. – Brandon Bertelsen Dec 17 '12 at 15:37
  • 1
    This may be just a toy example, but resampling from only 10 cases when you have 10 classes seems like trouble. And, in fact, if I reduce it to two classes, it runs fine. Adding labels where ytrain is defined also runs fine for me. Keeping 10 cases and classes and changing to another method of classifier (rpart, cforest) also works. So my guess is that train can't combine the output of whatever svm function in kernlab is getting run if the different outputs have different numbers of classes. This is just a guess though. – MattBagg Dec 17 '12 at 18:01
  • @MattBagg That fixed it. Want to add it as an answer so I can accept it? Thank you! – mchangun Dec 18 '12 at 15:49

2 Answers2

2

You have 10 different classes and yet you are only including 10 cases in train(). This means that when you resample you will frequently not have all 10 classes in individual instances of your classifier. train() is having difficulty combining the results of these varying-category SVMs.

You can fix this by some combination of increasing the number of cases, decreasing the number of classes, or even using a different classifier.

MattBagg
  • 10,268
  • 3
  • 40
  • 47
0

I found it challenging to use caret with the digit recognition use case. I think part of the problem is the label data is numeric. When caret tries to create variables from it they end up starting with a numeric, which is truly not accepted as an R variable.

In my case, I got around it by discretizing the label data using dplyr. This assumes your training data is placed into "train" dataframe.

descretize label to label2

train$label2=dplyr::recode(train$label, 0="zero", 1="one", 2="two",3="three",4="four",5="five",6="six",7="seven",8="eight",9="nine")

rearrange the columns so you can see new label2 along side original label

train <- train[, c((1),(786),(2:785))] head(train)

change label to be the factorized version of the discretized label2

train$label <- factor(train$label2)

kill label2 since it was a temp variable

train$label2 <- NULL

view result

head(train)

Community
  • 1
  • 1
Kevin
  • 1
  • 2