0

I have many large random forest classification models (~60min run time each) that are used for prediction of a raster using the type="prob" option. I am happy with the raster output (probabilities for each of x classes as a raster stack). However, I would like a simple way to covert these probabilities (a raster stack with x layers, where x is the number of classes) to a simple one layer classification (i.e. winners only, no probabilities). This would be equivalent of type="response".

Here is a simple example (which is not a raster, but still applies):

library(randomForest)
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.prob <- predict(iris.rf, type="prob")
iris.resp <- predict(iris.rf, type="response")

What is the most efficient way to use the iris.prob object to get the equivalent output of iris.resp without rerunning the randomforests (which, in my case with many large rasters, would take too many hours)?

Thanks in advance

treetopdewdrop
  • 172
  • 5
  • 15
  • Once you have run `iris.rf`, the `predict` functions do not require you to re-run the model. Once the model is run, predictions should be much faster since they are only using the outputs from the model in calculating either the probability or response. Are you trying to find what the most efficient way to run the prediction is? Or are you trying to figure out how to get the same values out of `type = "prob"` that you are getting out of `type = "response"`? – Geochem B Jul 19 '17 at 21:39
  • Yes, I agree. But I don't have access to the model (iris.rf) - Only the output probabilities (iris.prob). Need a simple way to convert probabilities object to a single classified object – treetopdewdrop Jul 19 '17 at 21:43
  • Ok, so someone already ran the model as well as `iris.prob`, and you are trying to replicate the `iris.resp` without running the model? I get that it would take many hours to run the model, and I am just trying to figure out the problem – Geochem B Jul 19 '17 at 21:47
  • Exactly. Thanks! – treetopdewdrop Jul 19 '17 at 21:49

2 Answers2

1

If you are trying to determine the max of multiple columns, with the same general format as iris.prob I would try to find the max from each row and return the colname.

colnames(iris.prob)[max.col(iris.prob,ties.method="first")]

Got the exact usage from another thread so if this isn't working you might try another response

Geochem B
  • 418
  • 3
  • 13
1

iris.prob should contains a classification result, with the probability that one observation is classified in one category. So you just need to extract the colname of the maximum value of each row.

Eg : iris.resp2 = colnames(iris.prob)[apply(iris.prob,1,which.max)]

iris.resp2 == as.character(iris.resp) should return TRUE everytime

  • Thanks, that is the general idea. However this method is extremely slow on large raster stacks. Trying to maximize efficiency. – treetopdewdrop Jul 19 '17 at 22:09
  • Hmm... so I'm not sure I can help, it's already quite optimized as it's only about using built-in functions in a vectorized way on an already computed matrix and a vector! I mean for a 723 Mb matrix with 1M lines and 100 columns it takes less than 5 seconds on my computer. I hope you'll find the answer ;)! Good luck –  Jul 19 '17 at 22:32