0

I have code that takes a dataset with a list of wines, then applies random forest model to predict quality. The model is working but the confusion matrix is giving the following error:

> cmrfRed <- confusionMatrix(rfpredictionsred, as.factor(testRed$quality))

Error: `data` and `reference` should be factors with the same levels.

If I use as.factor(), I get different levels:

> rfpredictionsred <- as.factor(rfpredictionsred)
> testRed$quality <- as.factor(testRed$quality)
> head(rfpredictionsred)
               1                2                3                4                5                6 
5.81016666666667 5.13646666666667 5.71616666666666 5.89953333333334 4.98553333333333 7.00693333333334 
2262 Levels: 4.31206666666667 4.31726666666667 4.33696666666667 4.34246666666666 4.3447 4.35073333333333 ... 7.9712
> head(testRed$quality)
[1] 6 6 6 5 6 8
Levels: 3 4 5 6 7 8 9

Here is my full code:

# Read in the red and white wine datasets
wines <- read_csv("wine-quality.csv")

# Convert 'type' variable to a factor
wines$type <- as.factor(wines$type)

# Add a column to indicate wine type
red_wines$type <- "red"
white_wines$type <- "white"

# Rename columns
names(wines) <- c("type","fixed_acidity","volatile_acidity","citric_acid", "residual_sugar",
                  "chlorides","free_sulfur_dioxide","total_sulfur_dioxide","density","pH",
                  "sulphates", "alcohol","quality")

# Set seed for reproducibility
set.seed(123)

# Split the data into training and testing sets
red <- wines[wines$type == "red",]
white <- wines[wines$type == "white",]

trainIndex <- createDataPartition(red$quality, p = 0.8, list = FALSE)
trainWhite <- white[trainIndex, ]
testWhite <- white[-trainIndex, ]

trainIndex <- createDataPartition(white$quality, p = 0.8, list = FALSE)
trainRed <- wines[trainIndex, ]
testRed <- wines[-trainIndex, ]

# Build a random forest model
rfModelWhite <- randomForest(quality ~ ., data = trainWhite, ntree = 500)
rfModelWhite
rfModelRed <- randomForest(quality ~ ., data = trainRed, ntree = 500)
rfModelRed

# Make predictions on the test set
rfpredictionswhite <- predict(rfModelWhite, testWhite)
rfpredictionsred <- predict(rfModelRed, newdata = testRed)
head(rfpredictionsred)
head(rfpredictionswhite)

# Evaluate model performance
rfRedrmse <- RMSE(rfpredictionsred, testRed$quality)
rfWhitermse <- RMSE(rfpredictionswhite, testWhite$quality)

# Confusion Matrices
cmrfRed <- confusionMatrix(rfpredictionsred, testRed$quality)
accuraccyrfRed <- cmrfRed$overall['Accuracy']
cmrfWhite <- confusionMatrix(rfpredictionswhite, testWhite$quality)
accuraccyrfWhite <- cmrfWhite$overall['Accuracy']

I tried looking at other posts and reading documentation and I think I understand factor function but I'm still not sure of how to fix in this case. https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/factor

  • 1
    Difficult to say without [seeing the data in wines](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). However, you can see that the factor levels in `rfpredictionsred` and `testRed$quality` are different: the first are numeric values with decimal places, converted to factors and the second are integers, converted to factors. Hence the error message about factors with different levels. – neilfws May 08 '23 at 23:57
  • @neilfws Thanks. I've linked the data I'm using. I see that error but I am not sure as to how to actually fix it. Would it be to convert the decimals to integers then factor both? https://archive.ics.uci.edu/ml/datasets/wine+quality – CocaCola May 09 '23 at 00:10
  • Yes, I think that would be one solution, or adjust the model so that it outputs integers (or factors). Perhaps you want multi-classification (categories) instead of regression? – neilfws May 09 '23 at 00:20

0 Answers0