2

I am new player in R and want to solve binary classification task.

Dataset has factor variable LABELS with 2 classes: first - 0, second - 1. The next image shows actual head of it: unbalanced_dataset TimeDate column - it's just index. Class distribution is defined as:

print("the number of values with % in factor variable - LABELS:")
percentage <- prop.table(table(dataset$LABELS)) * 100
cbind(freq=table(dataset$LABELS), percentage=percentage)

Result of class distribution: classes

Also I know that Slot2 column is calculated based on formula:

Slot2 = Var3 - Slot3 + Slot4

The features Var1,Var2,Var3,Var4 were selected after analysis the correlation matrix.

Before start the modeling i divided dataset to train and test parts. I tried to build Random forest Model for binary classification task used the next code:

rf2 <- randomForest(LABELS ~ Var1 + Var2  + Var3 + Var4, 
                    data=train, ntree = 100,
                    mtry = 4, importance = TRUE)
print(rf2)

The result is:

  Call:
     randomForest(formula = LABELS ~ Var1 + Var2  + Var3 + Var4,
     data = train, ntree = 100,      mtry = 4, importance = TRUE) 

 Type of random forest: classification
 Number of trees: 100
 No. of variables tried at each split: 4

 OOB estimate of  error rate: 0.16%

 Confusion matrix:
           0      1 class.error
    0 164957    341 0.002062941
    1    280 233739 0.001196484

When I tried to do predict:

# Prediction & Confusion Matrix - train data
p1 <- predict(rf2, train, type="prob")
print("Prediction & Confusion Matrix - train data")
confusionMatrix(p1, train$LABELS)

# # Prediction & Confusion Matrix - test data
p2 <- predict(rf2, test, type="prob")
print("Prediction & Confusion Matrix - test data")
confusionMatrix(p2, test$LABELS)

I received an error in R:

[1] "Prediction & Confusion Matrix - train data"
Error: `data` and `reference` should be factors with the same levels.
Traceback:

1. confusionMatrix(p1, train$LABELS)
2. confusionMatrix.default(p1, train$LABELS)
3. stop("`data` and `reference` should be factors with the same levels.", 
 .     call. = FALSE)

Also I have already tried to fix it by using idea from the following questions:

  1. Error in ConfusionMatrix the data and reference factors must have the same number of levels R CARET

  2. Error in Confusion Matrix : the data and reference factors must have the same number of levels

but it doesn't help in my case.

Could you please help me with this error?

I'll be appreciate for any ideas and comments.Thank you in advance.

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
Cindy
  • 568
  • 7
  • 20
  • What does `p1` look like? Without seeing your data, I'm guessing one issue is that you're predicting probabilities of each class, not the class itself. Trying changing to `type = "response"`, which will give the single most likely class for each observation. I'm not very familiar with the confusion matrix function, but guessing it expects classes, not probabilities – camille Jun 06 '18 at 15:20
  • @camille, thank you for your suggestion. It fixed an error, but the next problem appears that in result of prediction i received only one class, instead of 2 existing. – Cindy Jun 06 '18 at 15:25
  • That might be an issue with your data or your model. Please post a sample of your data for folks to work with – camille Jun 06 '18 at 15:46
  • @camille, I added the actual dataset and formula for calculated Slot2 column. Also there was added the class distribution for LABELS (binary) column. Classes are unbalanced. Dataset has more than 350k rows. When i tried to use **trainControl** function with **method = "repeatedcv"** for balanced it, in result I didn't receive the result in limited time due to huge size of dataset (as I understand). Thanks) – Cindy Jun 06 '18 at 16:17

1 Answers1

1

An error in R:

Error: `data` and `reference` should be factors with the same levels.

was fixed by changing type parameter in the predict function, correct code:

# Prediction & Confusion Matrix - train data
p1 <- predict(rf2, train, type="response")
print("Prediction & Confusion Matrix - train data")
confusionMatrix(p1, train$LABELS)

@Camille, Thank you so much)

Cindy
  • 568
  • 7
  • 20