0

When using the importance() function on R's randomForest you can get a list of the most important predictors.

I was wondering how to tell which predictors are associated with 1 of the specific binary outcomes? (i.e. which predictors are associated with disease outcomes and which predictors are associated with disease-free outcomes).

Here is my code to get the list of important predictors:

# Make a data frame with predictor names and their importance
imp_RF_model <- importance(RF_model)
imp_RF_model <- data.frame(predictors = rownames(imp_RF_model), imp_RF_model)

# Order the predictor levels by importance
imp_sort_RF_model <- arrange(imp_RF_model, desc(MeanDecreaseGini))
imp_sort_RF_model$predictors <- factor(imp_sort_RF_model$predictors, levels = imp_sort_RF_model$predictors)

# Select the top 20 predictors
imp_20_RF_model <- imp_sort_RF_model[1:20, ]

For example, if protein A is a strong predictor, I want to know if high levels of protein A are associated with the disease, or if high levels of protein A are associated with disease-free samples. So I want to know if the predictor is inversely associated with the disease or directly associated with the disease.

Alicia
  • 57
  • 1
  • 9
  • 1
    [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. We can't run your code without your data, and we can't see any of the output you're referencing. If this is about the next steps to take in your analysis, it's probably better suited for [stats.se] – camille Dec 05 '19 at 15:46
  • 1
    I don't really even understand your question. If a predictor is important for predicting Y=1 then isn't it equally important for predicting Y=0? – Dason Dec 05 '19 at 16:33
  • @Dason what I mean is: if protein A is a strong predictor, I want to know if high levels of protein A are associated with the disease, or if high levels of protein A are associated with disease-free samples. So I want to know if the predictor is inversely associated with the disease or directly associated with the disease. – Alicia Dec 06 '19 at 10:25
  • I am guessing to know the latter, just a simple tabulation of the mean of protein A for the disease and disease-free samples will suffice - if the mean is higher in the disease samples, then very likely protein A contributes positively. In random forests it can get more complicated than that but that should give you an indication at least. – Valeri Voev Dec 06 '19 at 12:10

0 Answers0