1

I have a huge dataset, and I am quite new to R, so the only way I can think of implementing 100-fold-CV by myself is through many for's and if's which makes it extremely inefficient for my huge dataset, and might even take several hours to compile. I started looking for packages that do this instead and found quite many topics related to CV on stackoverflow, and I have been trying to use the ones I found but none of them are working for me, I would like to know what I am doing wrong here.

For instance, this code from DAAG package:

cv.lm(data=Training_Points, form.lm=formula(t(alpha_cofficient_values)
%*% Training_Points), m=100, plotit=TRUE)

..gives me the following error:

Error in formula.default(t(alpha_cofficient_values)
%*% Training_Points) : invalid formula

I am trying to do Kernel Ridge Regression, therefore I have alpha coefficient values already computed. So for getting predictions, I only need to do either t(alpha_cofficient_values)%*% Test_Points or simply crossprod(alpha_cofficient_values,Test_Points) and this will give me all the predictions for unknown values. So I am assuming that in order to test my model, I should do the same thing but for KNOWN values, therefore I need to use my Training_Points dataset.

My Training_Points data set has 9000 columns and 9000 rows. I can write for's and if's and do 100-fold-CV each time take 100 rows as test_data and leave 8900 rows for training and do this until the whole data set is done, and then take averages and then compare with my known values. But isn't there a package to do the same? (and ideally also compare the predicted values with known values and plot them, if possible)

Please do excuse me for my elementary question, I am very new to both R and cross-validation, so I might be missing some basic points.

KulltMat
  • 11
  • 2
  • I don't quite understand how you already have the coefficients as these will be different for the 100 different models? So do you have a 9000 x 100 matrix of coefficients? – timcdlucas May 14 '16 at 16:30
  • You could use the caret package (I'll add an answer once I've checked how long it takes to run). Having 9k predictor variables does inevitably make it fairly slow. Do you need to do 100 fold cross validation? Would 10 fold be reasonable? – timcdlucas May 14 '16 at 16:31
  • Ah, caret doesn't have kernel ridge regression built in. You would have to add it. – timcdlucas May 14 '16 at 16:32

1 Answers1

1

The CVST package implements fast cross-validation via sequential testing. This method significantly speeds up the computations while preserving full cross-validation capability. Additionaly, the package developers also added default cross validation functionality.

I haven't used the package before but it seems pretty flexible and straightforward to use. Additionally, KRR is readily available as a CVST.learner object through the constructKRRLearner() function. To use the crossval functionality, you must first convert your data to a CVST.data object by using the constructData(x, y) function, with x the feature data and y the labels. Next, you can use one of the cross validation functions to optimize over a defined parameter space. You can tweak the settings of both the cv or fastcv methods to your liking.

After the cross validation spits out the optimal parameters you can create the model by using the learn function and subsequently predict new labels. I puzzled together an example from the package documentation on CRAN.

# contruct CVST.data using constructData(x,y)
# constructData(x,y)

# Load some data..
ns = noisySinc(1000)
# Kernel ridge regression
krr = constructKRRLearner()
# Create parameter Space
params=constructParams(kernel="rbfdot", sigma=10^(-3:3), 
                       lambda=c(0.05, 0.1, 0.2, 0.3)/getN(ns))

# Run Crossval
opt = fastCV(ns, krr, params, constructCVSTModel())
# OR.. much slower!
opt = CV(ns, krr, params, fold=100)

# p = list(kernel=opt[[1]]$kernel, sigma=opt[[1]]$sigma, lambda=opt[[1]]$lambda)
p = opt[[1]]
# Create model
m = krr$learn(ns, p)
# Predict with model
nsTest = noisySinc(10000)
pred = krr$predict(m, nsTest)
# Evaluate..
sum((pred - nsTest$y)^2) / getN(nsTest)

If further speedup is required, you can run the cross validations in parallel. View this post for an example of the doparallel package.

Community
  • 1
  • 1
Jellen Vermeir
  • 1,720
  • 10
  • 10