1

So I'm interested in creating a model that optimizes PPV. I've create a RF model (below) that outputs me a confusion matrix, for which I then manually calculate sensitivity, specificity, ppv, npv, and F1. I know right now accuracy is optimized but I'm willing to forgo sensitivity and specificity to get a much higher ppv.

data_ctrl_null <- trainControl(method="cv", number = 5, classProbs = TRUE, summaryFunction=twoClassSummary, savePredictions=T, sampling=NULL)

set.seed(5368)

model_htn_df <- train(outcome ~ ., data=htn_df, ntree = 1000, tuneGrid = data.frame(mtry = 38), trControl = data_ctrl_null, method= "rf", 
                           preProc=c("center","scale"),metric="ROC", importance=TRUE)

model_htn_df$finalModel #provides confusion matrix

Results:

Call:
  randomForest(x = x, y = y, ntree = 1000, mtry = param$mtry, importance = TRUE) 
           Type of random forest: classification
                 Number of trees: 1000
  No. of variables tried at each split: 38

    OOB estimate of  error rate: 16.2%
    Confusion matrix:
      no yes class.error
 no  274  19  0.06484642
 yes  45  57  0.44117647

My manual calculation: sen = 55.9% spec = 93.5%, ppv = 75.0%, npv = 85.9% (The confusion matrix switches my no and yes as outcomes, so I also switch the numbers when I calculate the performance metrics.)

So what do I need to do to get a PPV = 90%?

This is a similar question, but I'm not really following it.

PleaseHelp
  • 124
  • 11
  • check these out: https://stackoverflow.com/questions/31688073/calculate-ppv-and-npv-during-model-training-with-caret?rq=1, https://stackoverflow.com/questions/52691761/additional-metrics-in-caret-ppv-sensitivity-specificity – missuse Mar 26 '20 at 19:01

1 Answers1

1

We define a function to calculate PPV and return the results with a name:

PPV <- function (data,lev = NULL,model = NULL) {
   value <- posPredValue(data$pred,data$obs, positive = lev[1])
   c(PPV=value)
}

Let's say we have the following data:

library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)

Then we train by specifying PPV to be the metric:

mdl <- train(Species ~ ., data = data[trn,],
             method = "rf",
             metric = "PPV",
             trControl = trainControl(summaryFunction = PPV, 
                                      classProbs = TRUE))

Random Forest 

100 samples
  4 predictor
  2 classes: 'others', 'versi' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
Resampling results across tuning parameters:

  mtry  PPV      
  2     0.9682811
  3     0.9681759
  4     0.9648426

PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks, this is definitely helpful for outputting the performance metrics which I didn't know how to do. But I'm still more interested in how to increase PPV thru training. Any thoughts? – PleaseHelp Mar 27 '20 at 14:41
  • Your question is not very clear. you set the metric to be PPV. I update my answer now... and it really boils down to whether that can be tuned with your data – StupidWolf Mar 27 '20 at 15:12
  • Sorry I'm confusing, I'm new to this :( I guess what I want is how to set the metric to be PPV. In your above example, PPV = 0.93. Can you show me how you can train a model and the PPV will always equal 97% for example? – PleaseHelp Mar 27 '20 at 16:30
  • @PleaseHelp, see answer above. It will allow you to set the metric – StupidWolf Mar 27 '20 at 16:47
  • @StupidWolf thanks so much for your response, that makes sense! I was told that when you train a model, you can generate a precision-recall curve, find where your ppv you want (like 0.9) falls on that curve, and then extract the other performance metrics like sens, spec, etc. That's what I'm looking for. Do I need to ask a separate question for how to do that? – PleaseHelp Mar 27 '20 at 18:33
  • no problem. better to ask a separate question and provide a reproducible example at that.. something like what i provide in the answer, where people can try your code – StupidWolf Mar 27 '20 at 18:34
  • @PleaseHelp, if this answer was useful to you or solved your problem, you can also consider accepting the answer. https://stackoverflow.com/help/someone-answers. Just letting you know since you are new to SO. – StupidWolf Mar 27 '20 at 18:49