3

I have 150 columns of scores against 1 column of label (1/0). My goal is to create 150 AUC scores.

Here is a manual example:

auc(roc(df$label, df$col1)),
auc(roc(df$label, df$col2)),

...

I can use here Map/sapply/lapply but is there any other method, or function?

steves
  • 331
  • 3
  • 16
  • 1
    What's wrong with `apply` family functions? The `purrr` package does similar things but much more intuitively imo. But again, it's similar to the `apply` family of functions. What's your end goal, a dataframe with all the scores? – Amar Apr 16 '18 at 06:06
  • 1
    In `dplyr`, you could do something like: 1. `gather()` all the `col1, col2, ...` scores 2. `group_by()` the score number 3. `summarise()` to get the AUC. – Marius Apr 16 '18 at 06:24
  • @Amar yes sure, I will tell you why I am asking: I have 600000 rows in my 151 columns dataframe. I am running now sapply but I thought that maybe some experts here will suggest better way. Like Dask library in Python that parallelize the calculations. – steves Apr 16 '18 at 06:24
  • @Marius I am aware of this method, but would it be faster or same like running sapply? – steves Apr 16 '18 at 06:25
  • 1
    I really couldn't say without testing. I would guess about the same, I think most of the runtime will be in `roc()`/`auc()` and not affected by the method you use to loop/iterate, but it's possible there will be some differences. – Marius Apr 16 '18 at 06:28
  • To simplify, I have run it yesterday on 1 couple it took something like 5 minutes. 600000 rows each column. So 150*5 = 12.5h. – steves Apr 16 '18 at 06:31
  • I see that only one CPU is actually working, I want all 16 to work on the calculations, how can I tell to use them all? @Marius – steves Apr 16 '18 at 06:44
  • I assume you're using `library(pROC)` to calculate your AUCs? – Calimo Apr 16 '18 at 20:49
  • @Calimo sure, please share your experience if you are familiar with better techniques/libraries. – steves Apr 18 '18 at 05:19
  • @steves you may want to edit your question given that the accepted answer doesn't actually answer the question you asked, but rather the one I guessed you were probably after. – Calimo Apr 18 '18 at 05:39

3 Answers3

6

This is a bit of an XY question. What you actually want to achieve is speed up your calculation. gfgm's answer answers it with parallelization, but that's only one way to go.

If, as I assume, you are using library(pROC)'s roc/auc functions, you can gain even more speed by selecting the appropriate algorithm for your dataset.

pROC comes with essentially two algorithms that scale very differently depending on the characteristics of your data set. You can benchmark which one is the fastest by passing algorithm=0 to roc:

# generate some toy data
label <- rbinom(600000, 1, 0.5)
score <- rpois(600000, 10)

library(pROC)
roc(label, score, algorithm=0)
Starting benchmark of algorithms 2 and 3, 10 iterations...
  expr        min         lq       mean     median        uq      max neval
2    2 4805.58762 5827.75410 5910.40251 6036.52975 6085.8416 6620.733    10
3    3   98.46237   99.05378   99.52434   99.12077  100.0773  101.363    10
Selecting algorithm 3.

Here we select algorithm 3, which shines when the number of thresholds remains low. But if 600000 data points take 5 minutes to compute I strongly suspect that your data is very continuous (no measurements with identical values) and that you have about as many thresholds as data points (600000). In this case you can skip directly to algorithm 2 which scales much better as the number of thresholds in the ROC curve increases.

You can then run:

auc(roc(df$label, df$col1, algorithm=2)),
auc(roc(df$label, df$col2, algorithm=2)),

On my machine each call to roc now takes about 5 seconds, pretty independently of the number of thresholds. This way you should be done in under 15 minutes total. Unless you have 50 cores or more this is going to be faster than just parallelizing. But of course you can do both...

Calimo
  • 7,510
  • 4
  • 39
  • 61
4

If you want to parallelize the computations you could do it like this:

# generate some toy data
label <- rbinom(1000, 1, .5)
scores <- matrix(runif(1000*150), ncol = 150)
df <- data.frame(label, scores)

library(pROC)
library(parallel)

auc(roc(df$label, df$X1))
#> Area under the curve: 0.5103

auc_res <- mclapply(df[,2:ncol(df)], function(row){auc(roc(df$label, row))})
head(auc_res)
#> $X1
#> Area under the curve: 0.5103
#> 
#> $X2
#> Area under the curve: 0.5235
#> 
#> $X3
#> Area under the curve: 0.5181
#> 
#> $X4
#> Area under the curve: 0.5119
#> 
#> $X5
#> Area under the curve: 0.5083
#> 
#> $X6
#> Area under the curve: 0.5159

Since most of the computational time seems to be the call to auc(roc(...)) this should speed things up if you have a multi-core machine.

gfgm
  • 3,627
  • 14
  • 34
  • 1
    Done exactly like you but used `parSapply`. Why did you use `mclapply`? @gfgm – steves Apr 16 '18 at 08:05
  • tbh because I remember the syntax better and it saves me a line of code loading the environment on the clusters. But actually should have given example with `parSapply()/parLapply()` because `mcapply` is not windows compatible (which I didn't think of) @steves – gfgm Apr 16 '18 at 08:39
  • 2
    discussion of differences here: https://stackoverflow.com/questions/17196261/understanding-the-differences-between-mclapply-and-parlapply-in-r?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – gfgm Apr 16 '18 at 08:40
  • Thanks a lot, the issue is that if I use it with parSapply it takes less memory, while mclapply uses more and sometimes crashes. @gfgm – steves Apr 16 '18 at 16:04
3

There's a function for doing that in the cutpointr package. It also calculates cutpoints and other metrics, but you can discard those. By default it will try all columns except for the response column as predictors. Additionally, you can select whether the direction of the ROC curve (whether larger values imply the positive class or the other way around) is determined automatically by leaving out direction or set it manually.

dat <- iris[1:100, ]
library(tidyverse)
library(cutpointr)
mc <- multi_cutpointr(data = dat, class = "Species", pos_class = "versicolor", 
                silent = FALSE)
mc %>% select(variable, direction, AUC)

# A tibble: 4 x 3
  variable     direction   AUC
  <chr>        <chr>     <dbl>
1 Sepal.Length >=        0.933
2 Sepal.Width  <=        0.925
3 Petal.Length >=        1.00 
4 Petal.Width  >=        1.00  

By the way, the runtime shouldn't be a problem here because calculating the ROC-curve (even including a cutpoint) takes less than a second for one variable and one million observations using cutpointr or ROCR, so your task runs in about one or two minutes.

If memory is the limiting factor, parallelization will probably make that problem worse. If the above solution takes up too much memory, because it returns ROC-curves for all variables before dropping those columns, you can try selecting the columns of interest right away in a call to map:

# 600.000 observations for 150 variables and a binary outcome

predictors <- matrix(data = rnorm(150 * 6e5), ncol = 150)
dat <- as.data.frame(cbind(y = sample(0:1, size = 6e5, replace = T), predictors))

library(cutpointr)
library(tidyverse)

vars <- colnames(dat)[colnames(dat) != "y"]
result <- map_df(vars, function(coln) {
    cutpointr_(dat, x = coln, class = "y", silent = TRUE, pos_class = 1) %>%
        select(direction, AUC) %>%
        mutate(variable = coln)
})

result

# A tibble: 150 x 3
   direction   AUC variable
   <chr>     <dbl> <chr>   
 1 >=        0.500 V2      
 2 <=        0.501 V3      
 3 >=        0.501 V4      
 4 >=        0.501 V5      
 5 <=        0.501 V6      
 6 <=        0.500 V7      
 7 <=        0.500 V8      
 8 >=        0.502 V9      
 9 >=        0.501 V10     
10 <=        0.500 V11     
# ... with 140 more rows 
thie1e
  • 3,588
  • 22
  • 22