I have code that takes a dataset with a list of wines, then applies random forest model to predict quality. The model is working but the confusion matrix is giving the following error:
> cmrfRed <- confusionMatrix(rfpredictionsred, as.factor(testRed$quality))
Error: `data` and `reference` should be factors with the same levels.
If I use as.factor()
, I get different levels:
> rfpredictionsred <- as.factor(rfpredictionsred)
> testRed$quality <- as.factor(testRed$quality)
> head(rfpredictionsred)
1 2 3 4 5 6
5.81016666666667 5.13646666666667 5.71616666666666 5.89953333333334 4.98553333333333 7.00693333333334
2262 Levels: 4.31206666666667 4.31726666666667 4.33696666666667 4.34246666666666 4.3447 4.35073333333333 ... 7.9712
> head(testRed$quality)
[1] 6 6 6 5 6 8
Levels: 3 4 5 6 7 8 9
Here is my full code:
# Read in the red and white wine datasets
wines <- read_csv("wine-quality.csv")
# Convert 'type' variable to a factor
wines$type <- as.factor(wines$type)
# Add a column to indicate wine type
red_wines$type <- "red"
white_wines$type <- "white"
# Rename columns
names(wines) <- c("type","fixed_acidity","volatile_acidity","citric_acid", "residual_sugar",
"chlorides","free_sulfur_dioxide","total_sulfur_dioxide","density","pH",
"sulphates", "alcohol","quality")
# Set seed for reproducibility
set.seed(123)
# Split the data into training and testing sets
red <- wines[wines$type == "red",]
white <- wines[wines$type == "white",]
trainIndex <- createDataPartition(red$quality, p = 0.8, list = FALSE)
trainWhite <- white[trainIndex, ]
testWhite <- white[-trainIndex, ]
trainIndex <- createDataPartition(white$quality, p = 0.8, list = FALSE)
trainRed <- wines[trainIndex, ]
testRed <- wines[-trainIndex, ]
# Build a random forest model
rfModelWhite <- randomForest(quality ~ ., data = trainWhite, ntree = 500)
rfModelWhite
rfModelRed <- randomForest(quality ~ ., data = trainRed, ntree = 500)
rfModelRed
# Make predictions on the test set
rfpredictionswhite <- predict(rfModelWhite, testWhite)
rfpredictionsred <- predict(rfModelRed, newdata = testRed)
head(rfpredictionsred)
head(rfpredictionswhite)
# Evaluate model performance
rfRedrmse <- RMSE(rfpredictionsred, testRed$quality)
rfWhitermse <- RMSE(rfpredictionswhite, testWhite$quality)
# Confusion Matrices
cmrfRed <- confusionMatrix(rfpredictionsred, testRed$quality)
accuraccyrfRed <- cmrfRed$overall['Accuracy']
cmrfWhite <- confusionMatrix(rfpredictionswhite, testWhite$quality)
accuraccyrfWhite <- cmrfWhite$overall['Accuracy']
I tried looking at other posts and reading documentation and I think I understand factor function but I'm still not sure of how to fix in this case. https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/factor