-1

I'm building a Random Forrest Classifier and I would like to return classification and associated probabilities. My result variable is either 1 or 0, 1 being the positive class that I want to track.

no_of_trees <- 50
rf.under <- randomForest(as.factor(result) ~ . ,
                         data=data_balanced_under,
                         importance=TRUE,
                         ntree=no_of_trees) 

prediction <- predict(rf.under, df.test)
probability <- predict(rf.under, df.test, type="prob")
submit <- data.frame( predicted = prediction, actual = df.test$result)

I wanted probability to return the probability of positive results, however I get:

> probability
           0    1
242339  1.00 0.00
3356431 1.00 0.00
138327  1.00 0.00
111327  1.00 0.00
3307151 1.00 0.00
222414  1.00 0.00
1817297 1.00 0.00
3860922 1.00 0.00
1710532 1.00 0.00

in my output. What are these numbers on the left? I'm not sure what they are? I thought they are row numbers, but then, why aren't they indexed from 1,2,3..? I tied to get probability[,2] which I'm assuming gives me probability of the result, but also doesn't work.

Ideally, I would like to include the probabilities in the submit data frame, but currently unable to do so.

Also, confusion matrix gives me:

confusionMatrix(data = submit$predicted, reference = df.test$result , positive="1")

#Reference
Prediction      0      1
         0 913730    160
         1  50872   8219

Is it possible to switch this around? So that it shows positive class "1" first?

GRS
  • 2,807
  • 4
  • 34
  • 72
  • 2
    Please provide a reproducible example of your data as described [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – tobiasegli_te Nov 14 '17 at 15:50

1 Answers1

1

probability returns the probability by class (here you have two classes so two columns). This as been built this way to alow multiclass classification.

If you want probability of result == 1 just take the second column of probability

Since you have highly unbalanced classes (0.8% of ones) your classifier tends to predict that it is always 0... So your probability of result==1 is close to 0 for most exemples. This is why your probabilities doesn't look like probabilities.

Regarding the index of probability, it is rownames(df.test) the index of df.test. I guess you randomly splitted df.test from df. So index doesn't start by 1.

Emmanuel-Lin
  • 1,848
  • 1
  • 16
  • 31