1

So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

split <- sample.split(og.data$class_label, SplitRatio = 0.7)

training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)

rf.classifier.object = randomForest(x = training_set[-1],
                                          y = training_set$Engramcell,
                                          ntree = 500)

I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc. I do that using prediction probability generated like so...

predictions.df <- as.data.frame(predict(rf.classifier.object,
                                            test_set,
                                            type = "prob")
                                    )

All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.

predictions.df <- as.data.frame(predict(rf.classifier.object,
                                            newdata.df,
                                            type = "prob")
                                    )

Please note the error has different variable names simply because I changed the code to make it more general for readability.

Error in predict.randomForest(rf.classifier.object, newdata.df,  : 
  variables in the training data missing in newdata

As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.

# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]  
# [1] "class_label"

It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.

Angus Campbell
  • 563
  • 4
  • 19

0 Answers0