I have a small dataset (280 rows) with missing values. I used multiple imputations (mice package, m=5) to impute my data set.
Then, I applied different regression algorithms (i.e. SVM, rpart..etc) using 10-fold cross validation to each imputed data-set. I will use the resulted RMSE (root mean square error) value to compare between regression algorithms.
The thing is I will end up with 5 means of RMSE for each particular algorithm since the dataset has been imputed 5 times , My question is how can I combine that five RMSE that belong to one algorithm ? so I can carry out the comparison between the algorithms. In other words I want to compute the average coefficient , I know pool() function can do this, but I am not sure if I can use it with machine learning such as SVM & Random forest.
one solution I thought about is combining all data frames using long format then apply my algorithm and I will end up with one mean of RMSE , but I was concerned about the overfitting issue , as the long format may have repetitive records, please correct me if I am wrong ?
Thank you very much and hopefully, you can help me.
The following is my code.
x <- data
form <- data$target
fold <- 10 # number of fold for cross validation
imp <- mice(x, meth = "pmm", m=5) # Imputation using mice pmm (5 iteration)
impSetsVector <- list(); # will include the 5 imputed sets
for(i in seq(5))
{
impSetsVector[[i]] <- complete(imp, action = i, include = FALSE)
}
## Next I Applied RandomForest using 10 fold cross validation to each imputed set
## I Computed rmse for each dataset
avg.rmse <- matrix(data = NA,nrow=10, ncol=1) # include the mean of rmse for each imputed dataset.
for(j in seq(5)) # as we have 5 imputed dataset
{
x <- impSetsVector[[j]] # x will include the j iteration of imputed dataset
n <- nrow(x)
prop <- n%/%fold
set.seed(7)
newseq <- rank(runif(n))
k <- as.factor((newseq - 1)%/%prop + 1)
y <- unlist(strsplit(as.character(form), " "))[2]
vec.error <- vector(length = fold)
## start modeling with 10 fold cross validation
for (i in seq(fold)) {
# Perfrom RandomForest method
fit <- randomForest(form ~., data = x[k != i, ],ntree=500,keep.forest=TRUE,importance=TRUE,na.action = na.omit)
fcast <- predict(fit, newdata = x[k == i, ]) # predict using test set
rmse <- sqrt(mean((x[k == i, ]$y - fcast)^2))
vec.error[i] <- rmse # rmse for test set
}# end of the inner loop
avg.rmse[j] <- mean(vec.error) ## The mean of 10 rmse
}#end of loop