There's a function for doing that in the cutpointr
package. It also calculates cutpoints and other metrics, but you can discard those. By default it will try all columns except for the response column as predictors. Additionally, you can select whether the direction of the ROC curve (whether larger values imply the positive class or the other way around) is determined automatically by leaving out direction
or set it manually.
dat <- iris[1:100, ]
library(tidyverse)
library(cutpointr)
mc <- multi_cutpointr(data = dat, class = "Species", pos_class = "versicolor",
silent = FALSE)
mc %>% select(variable, direction, AUC)
# A tibble: 4 x 3
variable direction AUC
<chr> <chr> <dbl>
1 Sepal.Length >= 0.933
2 Sepal.Width <= 0.925
3 Petal.Length >= 1.00
4 Petal.Width >= 1.00
By the way, the runtime shouldn't be a problem here because calculating the ROC-curve (even including a cutpoint) takes less than a second for one variable and one million observations using cutpointr
or ROCR
, so your task runs in about one or two minutes.
If memory is the limiting factor, parallelization will probably make that problem worse. If the above solution takes up too much memory, because it returns ROC-curves for all variables before dropping those columns, you can try selecting the columns of interest right away in a call to map
:
# 600.000 observations for 150 variables and a binary outcome
predictors <- matrix(data = rnorm(150 * 6e5), ncol = 150)
dat <- as.data.frame(cbind(y = sample(0:1, size = 6e5, replace = T), predictors))
library(cutpointr)
library(tidyverse)
vars <- colnames(dat)[colnames(dat) != "y"]
result <- map_df(vars, function(coln) {
cutpointr_(dat, x = coln, class = "y", silent = TRUE, pos_class = 1) %>%
select(direction, AUC) %>%
mutate(variable = coln)
})
result
# A tibble: 150 x 3
direction AUC variable
<chr> <dbl> <chr>
1 >= 0.500 V2
2 <= 0.501 V3
3 >= 0.501 V4
4 >= 0.501 V5
5 <= 0.501 V6
6 <= 0.500 V7
7 <= 0.500 V8
8 >= 0.502 V9
9 >= 0.501 V10
10 <= 0.500 V11
# ... with 140 more rows