Statistical test with test-data

Question

If I am using two method (NN and KNN) with caret and then I want to provide significance test, how can I do wilcoxon test.

I provided sample of my data as follows

structure(list(Input = c(25, 193, 70, 40), Output = c(150, 98, 
        27, 60), Inquiry = c(75, 70, 0, 20), File = c(60, 36, 12, 12), 
        FPAdj = c(1, 1, 0.8, 1.15), RawFPcounts = c(1750, 1902, 535, 
        660), AdjFP = c(1750, 1902, 428, 759), Effort = c(102.4, 
        105.2, 11.1, 21.1)), row.names = c(NA, 4L), class = "data.frame")

    d=readARFF("albrecht.arff") 
    index <- createDataPartition(d$Effort, p = .70,list = FALSE)
    tr <- d[index, ]
    ts <- d[-index, ] 

    boot <- trainControl(method = "repeatedcv", number=100)

         cart1 <- train(log10(Effort) ~ ., data = tr,
                        method = "knn",
                        metric = "MAE",
                        preProc = c("center", "scale", "nzv"),
                        trControl = boot)

           postResample(predict(cart1, ts), log10(ts$Effort))

           cart2 <- train(log10(Effort) ~ ., data = tr,
                          method = "knn",
                          metric = "MAE",
                          preProc = c("center", "scale", "nzv"),
                          trControl = boot)

           postResample(predict(cart2, ts), log10(ts$Effort))

How to perform wilcox.test() here.

    Warm regards

the output of `postResample(predict(cart2, ts), log10(ts$Effort))` is a vector containing RMSE, Rsq and MAE, one value for each metric. So for instance MAE for knn is 1.5 and MAE for NN is 0.7. There is no statistical test that can compare one value to another. What you need is to generate several MAE values for knn and several MAE values for NN which you can compare using a statistical test. The most straightforward way to do so is by using Nested resampling. — missuse, Jun 04 '20 at 08:49
@missuse, thanks a lot. Actually I am thinking how to use nested CV in my case from two days. Someone provided hints for it but still I am slightly confuse how to use the outer CV for the statistical tests. — Neha gupta, Jun 04 '20 at 10:34
Ohhh my God, hello Milan.. I am very glad that you are here. I am sorry that I did not recognize you.. you are always more than helpful Milan — Neha gupta, Jun 04 '20 at 10:47

missuse · Accepted Answer · 2020-11-17T20:51:49.510

One way to deal with your problem is to generate several performance values for knn and NN which you can compare using a statistical test. This can be achieved using Nested resampling.

In nested resampling you are performing train/test splits multiple times and evaluating the model on each test set.

Lets for instance use BostonHousing data:

library(caret)
library(mlbench)

data(BostonHousing)

lets just select numerical columns for the example to make it simple:

d <- BostonHousing[,sapply(BostonHousing, is.numeric)]

As far as I know there is no way to perform nested CV in caret out of the box so a simple wrapper is needed:

generate outer folds for nested CV:

outer_folds <- createFolds(d$medv, k = 5)

Lets use bootstrap resampling as the inner resample loop to tune the hyper parameters:

boot <- trainControl(method = "boot",
                     number = 100)

now loop over the outer folds and perform hyper parameter optimization using the train set and predict on the test set:

CV_knn <- lapply(outer_folds, function(index){
  tr <- d[-index, ]
  ts <- d[index,]
  
  cart1 <- train(medv ~ ., data = tr,
                 method = "knn",
                 metric = "MAE",
                 preProc = c("center", "scale", "nzv"),
                 trControl = boot,
                 tuneLength = 10) #to keep it short we will just probe 10 combinations of hyper parameters
  
  postResample(predict(cart1, ts), ts$medv)
})

extract just MAE from the results:

sapply(CV_knn, function(x) x[3]) -> CV_knn_MAE
CV_knn_MAE
#output
Fold1.MAE Fold2.MAE Fold3.MAE Fold4.MAE Fold5.MAE 
 2.503333  2.587059  2.031200  2.475644  2.607885

Do the same for glmnet learner for instance:

CV_glmnet <- lapply(outer_folds, function(index){
  tr <- d[-index, ]
  ts <- d[index,]
  
  cart1 <- train(medv ~ ., data = tr,
                 method = "glmnet",
                 metric = "MAE",
                 preProc = c("center", "scale", "nzv"),
                 trControl = boot,
                 tuneLength = 10)
  
  postResample(predict(cart1, ts), ts$medv)
})

sapply(CV_glmnet, function(x) x[3]) -> CV_glmnet_MAE

CV_glmnet_MAE
#output
Fold1.MAE Fold2.MAE Fold3.MAE Fold4.MAE Fold5.MAE 
 3.400559  3.383317  2.830140  3.605266  3.525224

now compare the two using wilcox.test. Since the performance for both learners was generated using the same data splits a paired test is appropriate:

wilcox.test(CV_knn_MAE,
            CV_glmnet_MAE,
            paired = TRUE)

If comparing more than two algorithms one can use friedman.test

Thanks a lot Milan once again for your detailed response. In other words for the exact solution I am looking for. It will also help others, I am sure.. You are cooperative in public as well as private forums.. — Neha gupta, Jun 04 '20 at 11:25

score 1 · Answer 2 · answered Jun 04 '20 at 00:38

Does this work for you?

library(caret)
df <- structure(list(Input = c(25, 193, 70, 40), Output = c(150, 98, 
                                                      27, 60), Inquiry = c(75, 70, 0, 20), File = c(60, 36, 12, 12), 
               FPAdj = c(1, 1, 0.8, 1.15), RawFPcounts = c(1750, 1902, 535, 
                                                           660), AdjFP = c(1750, 1902, 428, 759), Effort = c(102.4, 
                                                                                                             105.2, 11.1, 21.1)), row.names = c(NA, 4L), class = "data.frame")

# not enough data points in df for ML: increase the number of df rows X10
d <- df[rep(seq_len(nrow(df)), 10), ]

index <- createDataPartition(d$Effort, p = .70,list = FALSE)
tr <- d[index, ]
ts <- d[-index, ] 

boot <- trainControl(method = "repeatedcv", number=100)

cart1 <- train(log10(Effort) ~ ., data = tr,
               method = "knn",
               metric = "MAE",
               preProc = c("center", "scale", "nzv"),
               trControl = boot)

# save the output to "model_predictions_1"
model_predictions_1 <- postResample(predict(cart1, ts), log10(ts$Effort))

cart2 <- train(log10(Effort) ~ ., data = tr,
               method = "knn",
               metric = "MAE",
               preProc = c("center", "scale", "nzv"),
               trControl = boot)

# save the output to "model_predictions_2"
model_predictions_2 <- postResample(predict(cart2, ts), log10(ts$Effort))

# test model_predictions_1 vs model_predictions_2
wilcox.test(model_predictions_1, model_predictions_2, exact = FALSE)

thanks for your response.. But dont you think model_predictions_1 and model_predictions_2 contains just one values of test data (i.e. MAE value), so would it be appropriate to perform statistical tests on one value? Regards — Neha gupta, Jun 04 '20 at 10:37
I do not think it would be appropriate / useful to compare models in this way, but that was what you asked for (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) — jared_mamrot, Jun 05 '20 at 10:24

Statistical test with test-data

2 Answers2

Linked