I have multiple classification machine learning models with all different accuracy. When I run my xgBOOST (using library(caret)
) in the console, I get an accuracy of 0.7586. But when I knit my Rmarkdown, the accuracy of the same model is 0.8621. I have no idea why this is different.
I followed the suggestions of this link, but nothing worked: https://community.rstudio.com/t/console-and-rmd-output-differ-same-program-used-but-the-calculation-gives-a-different-result/67873/3
I also followed the suggestions of problem, but nothing worked: Statistics Result in R Markdown is different from the Knit Output (All Format: Word, HTML, PDF)
At last I tried this, but also nothing worked: sample function gives different result in console and in knitted document when seed is set
Here is my code which I run the same in console and Rmarkdown but with different accuracy:
# Data
data <- data[!is.na(data$var1),]
# Change levels of var1
levels(data$var1)=c("No","Yes")
#Data Preparation and Preprocessing
# Create the training and test datasets
set.seed(100)
# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(data$var1, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainset <- data[trainRowNumbers,]
# Step 3: Create the test dataset
testset <- data[-trainRowNumbers,]
# Store Y for later use.
y = trainset$var1
# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(as.data.frame(trainset), method= c("knnImpute"))
preProcess_missingdata_model
# Create the knn imputation model on the testset data
preProcess_missingdata_model_test <- preProcess(as.data.frame(testset), method = c("knnImpute"))
preProcess_missingdata_model_test
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
trainset <- predict(preProcess_missingdata_model, newdata = trainset)
anyNA(trainset)
# Use the imputation model to predict the values of missing data points
library(RANN) # required for knnInpute
testset <- predict(preProcess_missingdata_model_test, newdata = testset)
anyNA(testset)
# Append the Y variable
trainset$var1 <- y
# Run algorithms using 5-fold cross validation
control <- trainControl(method="cv",
number=5,
repeats = 5,
savePredictions = "final",
search = "grid",
classProbs = TRUE)
metric <- "Accuracy"
# Make Valid Column Names
colnames(trainset) <- make.names(colnames(trainset))
colnames(testset) <- make.names(colnames(testset))
# xgBOOST
set.seed(7)
fit.xgbDART <- train(var1~., data = trainset, method = "xgbTree", metric = metric, trControl = control, verbose = FALSE, tuneLength = 7, nthread = 1)
# estimate skill of xgBOOST on the testset dataset
predictions <- predict(fit.xgbDART, testset)
cm <- caret::confusionMatrix(predictions, testset$var1, mode='everything')
cm
My RNGKind is:
RNGkind()
[1] "L'Ecuyer-CMRG" "Inversion" "Rejection"