5

I have multiple classification machine learning models with all different accuracy. When I run my xgBOOST (using library(caret)) in the console, I get an accuracy of 0.7586. But when I knit my Rmarkdown, the accuracy of the same model is 0.8621. I have no idea why this is different.

I followed the suggestions of this link, but nothing worked: https://community.rstudio.com/t/console-and-rmd-output-differ-same-program-used-but-the-calculation-gives-a-different-result/67873/3

I also followed the suggestions of problem, but nothing worked: Statistics Result in R Markdown is different from the Knit Output (All Format: Word, HTML, PDF)

At last I tried this, but also nothing worked: sample function gives different result in console and in knitted document when seed is set

Here is my code which I run the same in console and Rmarkdown but with different accuracy:

    # Data
    data <- data[!is.na(data$var1),]

# Change levels of var1
levels(data$var1)=c("No","Yes")

#Data Preparation and Preprocessing
# Create the training and test datasets
set.seed(100)

# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(data$var1, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainset <- data[trainRowNumbers,]

# Step 3: Create the test dataset
testset <- data[-trainRowNumbers,]

# Store Y for later use.
y = trainset$var1

# Create the knn imputation model on the training data
preProcess_missingdata_model <- preProcess(as.data.frame(trainset), method= c("knnImpute"))
preProcess_missingdata_model

# Create the knn imputation model on the testset data
preProcess_missingdata_model_test <- preProcess(as.data.frame(testset), method = c("knnImpute"))
preProcess_missingdata_model_test

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
trainset <- predict(preProcess_missingdata_model, newdata = trainset)
anyNA(trainset)

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
testset <- predict(preProcess_missingdata_model_test, newdata = testset)
anyNA(testset)

# Append the Y variable
trainset$var1 <- y

# Run algorithms using 5-fold cross validation
control <- trainControl(method="cv", 
                        number=5, 
                        repeats = 5, 
                        savePredictions = "final", 
                        search = "grid", 
                        classProbs = TRUE)
metric <- "Accuracy"

# Make Valid Column Names 
colnames(trainset) <- make.names(colnames(trainset))
colnames(testset) <- make.names(colnames(testset))

# xgBOOST
set.seed(7)
fit.xgbDART <- train(var1~., data = trainset, method = "xgbTree", metric = metric, trControl = control, verbose = FALSE, tuneLength = 7, nthread = 1)

# estimate skill of xgBOOST on the testset dataset
predictions <- predict(fit.xgbDART, testset)
cm <- caret::confusionMatrix(predictions, testset$var1, mode='everything')
cm

My RNGKind is:

RNGkind()
[1] "L'Ecuyer-CMRG" "Inversion"     "Rejection" 
  • 2
    When doing tasks involving sampling e.g. v-fold cross validation, up/downsampling, tuning of hyperparameters - are you using set.seed()? If not, each time you knit, you might be using a different seed to generate random numbers. – Desmond May 28 '21 at 12:42
  • @Desmond Yes I am using set.seed(), but do I need to put set.seed(7) before every doing task like cross validation and tuning or is once enough? –  May 28 '21 at 12:59
  • Before doing every tasks that involves random numbers. That's what I do at least. – Desmond May 28 '21 at 13:44
  • @Desmond Even setting a seed by every tasks, I keep getting different results. If I have to chose, which result is most reliable console or Rmarkdown? –  May 28 '21 at 15:14
  • Could you share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your data and results? – Desmond May 29 '21 at 01:59
  • @Desmond see update above! –  May 29 '21 at 09:10

2 Answers2

1

always add the function :

set.seed(544) 

This function sets the starting number used to generate a sequence of random numbers – it ensures that you get the same result if you start with that same seed each time you run the same process. For example, if I use the sample() function immediately after setting a seed, I will always get the same sample.

0

This is my suggestion on where to use set.seed()

# Data
data <- data[!is.na(data$var1),]

# Change levels of var1
levels(data$var1)=c("No","Yes")

#Data Preparation and Preprocessing
# Create the training and test datasets

# Step 1: Get row numbers for the training data
set.seed(100)
trainRowNumbers <- createDataPartition(data$var1, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainset <- data[trainRowNumbers,]

# Step 3: Create the test dataset
testset <- data[-trainRowNumbers,]

# Store Y for later use.
y = trainset$var1

# Create the knn imputation model on the training data
set.seed(100)
preProcess_missingdata_model <- preProcess(as.data.frame(trainset), method= c("knnImpute"))
preProcess_missingdata_model

# Create the knn imputation model on the testset data
set.seed(100)
preProcess_missingdata_model_test <- preProcess(as.data.frame(testset), method = c("knnImpute"))
preProcess_missingdata_model_test

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
trainset <- predict(preProcess_missingdata_model, newdata = trainset)
anyNA(trainset)

# Use the imputation model to predict the values of missing data points
library(RANN)  # required for knnInpute
testset <- predict(preProcess_missingdata_model_test, newdata = testset)
anyNA(testset)

# Append the Y variable
trainset$var1 <- y

# Run algorithms using 5-fold cross validation
set.seed(100)
control <- trainControl(method="cv", 
                        number=5, 
                        repeats = 5, 
                        savePredictions = "final", 
                        search = "grid", 
                        classProbs = TRUE)
metric <- "Accuracy"

# Make Valid Column Names 
colnames(trainset) <- make.names(colnames(trainset))
colnames(testset) <- make.names(colnames(testset))

# xgBOOST
set.seed(7)
fit.xgbDART <-
  train(
    var1 ~ .,
    data = trainset,
    method = "xgbTree",
    metric = metric,
    trControl = control,
    verbose = FALSE,
    tuneLength = 7,
    nthread = 1
  )

# estimate skill of xgBOOST on the testset dataset
predictions <- predict(fit.xgbDART, testset)
cm <- caret::confusionMatrix(predictions, testset$var1, mode='everything')
Desmond
  • 1,047
  • 7
  • 14
  • 1
    thanks for your suggestions. I tried running my script on a different laptop and then the results are exactly the same in both console and rmarkdown. So I suggest there is something wrong my environment or something else –  May 31 '21 at 13:19