0

I have a small dataset (280 rows) with missing values. I used multiple imputations (mice package, m=5) to impute my data set.

Then, I applied different regression algorithms (i.e. SVM, rpart..etc) using 10-fold cross validation to each imputed data-set. I will use the resulted RMSE (root mean square error) value to compare between regression algorithms.

The thing is I will end up with 5 means of RMSE for each particular algorithm since the dataset has been imputed 5 times , My question is how can I combine that five RMSE that belong to one algorithm ? so I can carry out the comparison between the algorithms. In other words I want to compute the average coefficient , I know pool() function can do this, but I am not sure if I can use it with machine learning such as SVM & Random forest.

one solution I thought about is combining all data frames using long format then apply my algorithm and I will end up with one mean of RMSE , but I was concerned about the overfitting issue , as the long format may have repetitive records, please correct me if I am wrong ?

Thank you very much and hopefully, you can help me.

The following is my code.

x <- data 
form <- data$target
fold <- 10  # number of fold for cross validation

imp <- mice(x, meth = "pmm", m=5) # Imputation using mice pmm (5 iteration)

impSetsVector <- list(); # will include the 5 imputed sets
for(i in seq(5))
{
  impSetsVector[[i]] <- complete(imp, action = i, include = FALSE)
}


## Next I Applied RandomForest using 10 fold cross validation to each imputed set
## I Computed rmse for each dataset

avg.rmse <- matrix(data = NA,nrow=10, ncol=1) # include the mean of rmse for each imputed dataset.

for(j in seq(5))  # as we have 5 imputed dataset
{
  x <- impSetsVector[[j]] # x will include the j iteration of imputed dataset
n <- nrow(x)
prop <- n%/%fold
set.seed(7)
newseq <- rank(runif(n))
k <- as.factor((newseq - 1)%/%prop + 1)
y <- unlist(strsplit(as.character(form), " "))[2] 
vec.error <- vector(length = fold)
## start modeling with 10 fold cross validation
for (i in seq(fold)) {
  # Perfrom RandomForest method
  fit <- randomForest(form ~., data = x[k != i, ],ntree=500,keep.forest=TRUE,importance=TRUE,na.action = na.omit)

  fcast <- predict(fit, newdata = x[k == i, ]) # predict using test set
  rmse <-  sqrt(mean((x[k == i, ]$y - fcast)^2)) 
  vec.error[i] <- rmse # rmse for test set
}# end of the inner loop

avg.rmse[j] <- mean(vec.error) ## The mean of 10 rmse 

}#end of loop
Zee H
  • 23
  • 5
  • There are multiple ways to do this, such as (Bayesian) model averaging. – alexwhitworth Sep 29 '16 at 16:46
  • ... Also, I'm not sure what 95% of your post has to do with your question. It seems to me that the heart of the question is how to compare RMSE between models. I think your post would be better if you focused the question on only that. A [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for that shouldn't be too difficult to create. – alexwhitworth Sep 29 '16 at 16:47
  • Thank you very much @AlexW for your comment. Actually the heart of my question is how to combine the RMSE that resulted from complete(impute,action=1) dataset, complete(impute, action=2) dataset, .... ,complete(impute,action=5) dataset? . I hope my question is clear ? or I need to edit it? – Zee H Sep 29 '16 at 17:08
  • Exactly... But instead of focusing your question in that 'space,' you spend 95% of your post on imputation and CV, which have nothing to do with your actual question. I suggest you simplify the question to focus on **just** combining / comparing the RMSE results. As a result, you should also update your tags appropriately-- `algorithm`, `r-mice`, `standard-error`, and `imputation` are not needed. – alexwhitworth Sep 29 '16 at 17:40

0 Answers0