leave one out cross validation in R returns a very low accuracy results (Looking for feedback and comments)

Question

I am trying to compute the accuracy of a decision tree on the seeds dataset (Link to the seeds dataset) over 20 iterations, however, I am getting very low overall accuracy (30%-35%). This is what I've done so far:

library(rpart)
seed = read.csv("seeds_dataset.txt",header= F, sep="\t")
colnames(seed)<- c("area", "per.", "comp.", "l.kernel", "w.kernel","asy_coeff", "lenkernel","type")

sampleSize <- nrow(seed)
mat = matrix(nrow=sampleSize, ncol=20) 
for (t in 1:20) {
  testSampleIdx <- sample(nrow(seed), size=sampleSize)
  data <- seed[testSampleIdx,]

  for (i in 1:nrow(data)){
    training = data[-i, ]
    test = data[i, ] 
    classification = rpart(type ~ ., data=training, method="class") 
    prediction = predict(classification, newdata=test, type="class")
    cm = table(test$type, prediction)
    accuracy <- sum(diag(cm))/sum(cm)
    mat[i,t] = accuracy 
  }
}
for (i in 1:ncol(mat)){
  print(paste("accuracy for ",i," iteration ", round((mean(mat[, i]))*100,1), "%", sep=""))
}
print(paste("overall accuracy ", round((mean(mat))*100,1), "%", sep=""))

Can anyone provide me with comments and feedback on the reason causing this low accuracy? Thank you.

@Rui Barradas - Reinstate Monic Can you help me with this issue, please? — s_am, Jan 03 '20 at 15:48
This doesn't appear to be a specific programming question that's appropriate for Stack Overflow. If you want advice on improving the accuracy of a statistical model, you should probably ask your question instead at [stats.se] where statistical question are on topic. — MrFlick, Jan 03 '20 at 16:05
@MrFlick Thank you very much for your comment and suggestion. I want to check that my code is correct and bug-free and make sure that I am getting correct results. Thank you again — s_am, Jan 03 '20 at 16:16
I think you should follow our reproducibility guidelines, you may want to read [how-to-make-a-great-r-reproducible-example](https://stackoverflow.com/a/5963610/6574038). — jay.sf, Jan 03 '20 at 16:18
A couple of notes: 1. The code as shown produces NaN, because the initial dataset contains rows for which type is NA. Those rows need to be removed first. 2. When you are taking a sample (testSampleIdx <- sample(nrow(seed), size=sampleSize)) you are just taking a permutation of the entire dataset. Therefor the outer for loop is unnecessary. In the final mat all t columns will give the same accuracy, since they are permutations of the same data. The accuracy I am getting is 35%. — BigFinger, Jan 03 '20 at 16:23
To get a distribution of accuracies, instead of a single value, you could do a bootstrap approach. Do the sampling with replacement, i.e. testSampleIdx <- sample(nrow(seed), size=sampleSize, replace=TRUE) — BigFinger, Jan 03 '20 at 16:26
@BigFinger Thank you very much for your valuable feedback, I am getting nearly the same results. does that mean that my coding is error-free? and this is an accurate result? — s_am, Jan 03 '20 at 16:31

score 0 · Answer 1 · answered Jan 03 '20 at 16:31

Here is the edited code:

library(rpart)
seed.all = read.csv("~/Downloads/seeds_dataset.txt",header= F, sep="\t")
colnames(seed.all)<- c("area", "per.", "comp.", "l.kernel", "w.kernel","asy_coeff", "lenkernel","type")

seed = seed.all[!is.na(seed.all$type),]

sampleSize <- nrow(seed)
mat = matrix(nrow=sampleSize, ncol=20) 
for (t in 1:20) {
  testSampleIdx <- sample(nrow(seed), size=sampleSize, replace=TRUE)
  data <- seed[testSampleIdx,]

  for (i in 1:nrow(data)){
    training = data[-i, ]
    test = data[i, ] 
    classification = rpart(type ~ ., data=training, method="class") 
    prediction = predict(classification, newdata=test, type="class")
    cm = table(test$type, prediction)
    accuracy <- sum(diag(cm))/sum(cm)
    mat[i,t] = accuracy 
  }
}
for (i in 1:ncol(mat)){
  print(paste("accuracy for ",i," iteration ", round((mean(mat[, i]))*100,1), "%", sep=""))
}
## [1] "accuracy for 1 iteration 30.1%"
## [1] "accuracy for 2 iteration 34%"
## [1] "accuracy for 3 iteration 28.6%"
## [1] "accuracy for 4 iteration 34.5%"
## [1] "accuracy for 5 iteration 38.3%"
## [1] "accuracy for 6 iteration 33.5%"
## [1] "accuracy for 7 iteration 33.5%"
## [1] "accuracy for 8 iteration 36.9%"
## [1] "accuracy for 9 iteration 25.7%"
## [1] "accuracy for 10 iteration 31.6%"
## [1] "accuracy for 11 iteration 35.4%"
## [1] "accuracy for 12 iteration 39.8%"
## [1] "accuracy for 13 iteration 38.8%"
## [1] "accuracy for 14 iteration 21.8%"
## [1] "accuracy for 15 iteration 32.5%"
## [1] "accuracy for 16 iteration 34.5%"
## [1] "accuracy for 17 iteration 33%"
## [1] "accuracy for 18 iteration 39.3%"
## [1] "accuracy for 19 iteration 31.1%"
## [1] "accuracy for 20 iteration 33.5%"
print(paste("overall accuracy ", round((mean(mat))*100,1), "%", sep=""))
## [1] "overall accuracy 33.3%"

leave one out cross validation in R returns a very low accuracy results (Looking for feedback and comments)

1 Answers1