1

I'm working on a classification-type problem using some simulations, and I want to count the number of true positives, false positives, etc. using different thresholds. For example, consider the following example:

library(tidyverse)

set.seed(23)
n <- 100
df <- tibble(
  class = sample(LETTERS[1:5], 100, replace = TRUE),
  pred_class = sample(LETTERS[1:5], 100, replace = TRUE),
  correct = class == pred_class,
  pval = runif(100, 0, 1)
) %>% 
  print()
#> # A tibble: 100 x 4
#>    class pred_class correct   pval
#>    <chr> <chr>      <lgl>    <dbl>
#>  1 C     E          FALSE   0.643 
#>  2 B     C          FALSE   0.561 
#>  3 B     C          FALSE   0.824 
#>  4 D     A          FALSE   0.971 
#>  5 E     A          FALSE   0.0283
#>  6 C     D          FALSE   0.723 
#>  7 E     D          FALSE   0.521 
#>  8 E     D          FALSE   0.619 
#>  9 E     E          TRUE    0.198 
#> 10 E     B          FALSE   0.386 
#> # ... with 90 more rows

For a fixed cut-off, the task is trivial (please ignore the direction of the assignments, they are correct for the actual task I'm working on, but I do recognize they might appear backwards here). This is what I'm trying to accomplish, but with more than 1 cutoff:

df %>%
  summarize(
    cutoff = 0.05,
    TP = sum(!correct & pval < 0.05),
    FP = sum(correct & pval < 0.05),
    FN = sum(!correct & pval >= 0.05),
    TN = sum(correct & pval >= 0.05)
  )
#> # A tibble: 1 x 5
#>   cutoff    TP    FP    FN    TN
#>    <dbl> <int> <int> <int> <int>
#> 1   0.05     5     1    73    21

But for multiple cutoffs, say a <- c(0.01, 0.05, 0.1) or a <- seq(0, .15, 0.01), this is a lot of cut and paste. So my goal is to figure out how to do this with functionals and (I think?) summarize_at. Unfortunately, this is giving me issues.

I can get this to work when the sums are based on single variables. It's ugly, but the following works:

# define the functionals (note only 2 since we are only looking at 1 variable)

a <- c(0.01, 0.05, 0.1)
pfun <- list(
  less_p = function(a) {function(p) sum(p < a)},
  more_p = function(a) {function(p) sum(p >= a)}
) %>%
  imap(~list(f = .x, label = .y))

fun_list <- cross(list(alpha = alpha, f = pfun)) %>% map(function(x) {
  list(
    f = x$f$f(x$alpha),
    label = paste(x$f$label, x$alpha, sep = "_")
  )
}) %>%
  set_names(., map_chr(., ~ .x$label)) %>%
  map(~ .x$f)

df %>%
  summarize_at(
    .vars = vars(pval),
    .funs = funs(!!!fun_list)
  )
#> # A tibble: 1 x 10
#>   less_p_0.01 less_p_0.02 less_p_0.03 less_p_0.04 less_p_0.05 more_p_0.01
#>         <int>       <int>       <int>       <int>       <int>       <int>
#> 1           1           3           4           4           6          99
#> # ... with 4 more variables: more_p_0.02 <int>, more_p_0.03 <int>,
#> #   more_p_0.04 <int>, more_p_0.05 <int>

Some gather, separate and spread fun, and this will be in the desired format.

However, when we write the functionals using the variable correct too, it breaks because correct isn't found:

afun <- list(
  TP_fun = function(a) { function(p) sum(!correct & p <  a)},
  FP_fun = function(a) { function(p) sum( correct & p <  a)},
  FN_fun = function(a) { function(p) sum(!correct & p >= a)},
  TN_fun = function(a) { function(p) sum( correct & p >= a)}
) %>%
  imap(~list(f = .x, label = .y))

# all combinations of alpha and the functions
fun_list <- cross(list(alpha = alpha, f = afun)) %>% map(function(x) {
  list(
    f = x$f$f(x$alpha),
    label = paste(x$f$label, x$alpha, sep = "_")
  )
}) %>%
  set_names(., map_chr(., ~ .x$label)) %>%
  map(~ .x$f)


df %>%
  summarize_at(
    .vars = vars(pval),
    .funs = funs(!!!fun_list)
  )
#> Error in summarise_impl(.data, dots): Evaluation error: object 'correct' not found.

I tried replacing correct in the functionals with .$correct, but this does not solve the problem. What is the best way to reference additional variables from within the functional?

As an aside - I feel like there should be a simpler solution to this problem. If I'm over-complicating a simple problem, please feel free to

Created on 2019-01-30 by the reprex package (v0.2.1)

Melissa Key
  • 4,476
  • 12
  • 21
  • 1
    I asked a similar question a couple months ago and got really good answers that might help: https://stackoverflow.com/q/53288100/5325862 – camille Jan 31 '19 at 03:43
  • Thank you! that got me what I needed. end result is `map_df( fun_list, function(f, data) { f(data$pval, data$correct) }, data = df )` – Melissa Key Jan 31 '19 at 04:08

0 Answers0