Speed up Tidyverse Calculation of Multiple Quantiles

Question

I have this great little function summarise_posterior (given below) as part of my package driver (available here).

The function is great and super useful. The one problem is that I have been working on larger and larger data and it can be very slow. In short, my question is: Is there a tidyverse-esque way of speeding this up while still retaining the key flexibility of this function (see examples in documentation).

At least one key speed up could come from figuring out how to put the calculation of the quantiles in a single call rather than calling the quantile function over and over. The latter option which is currently implemented is probably re-sorting same vectors over and over.

#' Shortcut for summarize variable with quantiles and mean
#'
#' @param data tidy data frame
#' @param var variable name (unquoted) to be summarised
#' @param ... other expressions to pass to summarise
#'
#' @return data.frame
#' @export
#' @details Notation: \code{pX} refers to the \code{X}\% quantile
#' @import dplyr
#' @importFrom stats quantile
#' @importFrom rlang quos quo UQ
#' @examples
#' d <- data.frame("a"=sample(1:10, 50, TRUE),
#'                 "b"=rnorm(50))
#'
#' # Summarize posterior for b over grouping of a and also calcuate
#' # minmum of b (in addition to normal statistics returned)
#' d <- dplyr::group_by(d, a)
#' summarise_posterior(d, b, mean.b = mean(b), min=min(b))
summarise_posterior <- function(data, var, ...){
  qvar <- enquo(var)
  qs <- quos(...)


  data %>%
    summarise(p2.5 = quantile(!!qvar, prob=0.025),
              p25 = quantile(!!qvar, prob=0.25),
              p50 = quantile(!!qvar, prob=0.5),
              mean = mean(!!qvar),
              p75 = quantile(!!qvar, prob=0.75),
              p97.5 = quantile(!!qvar, prob=0.975),
              !!!qs)
}

Rcpp back-end options are also more than welcome.

Thanks!

score 2 · Accepted Answer · answered Nov 16 '18 at 03:38

Here's a solution that makes use of nesting to avoid calling quantile multiple times. Any time you need to store a vector of results inside summarize, simply wrap it inside list. Afterwards, you can unnest these results, pair them up against their names, and use spread to put them in separate columns:

summarise_posterior2 <- function(data, var, ...){
  qvar <- ensym(var)
  vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)

  summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
             .nms = list(str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
  unnest %>% spread( .nms, .qq )  
}

This doesn't give you nearly the same speed up as @jay.sf's solution

d <- data.frame("a"=sample(1:10, 5e5, TRUE), "b"=rnorm(5e5))    
microbenchmark::microbenchmark( f1 = summarise_posterior(d, b, mean.b = mean(b), min=min(b)),
                                f2 = summarise_posterior2(d, b, mean.b = mean(b), min=min(b)) )
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
#    f1 49.06697 50.81422 60.75100 52.43030 54.17242 200.2961   100
#    f2 29.05209 29.66022 32.32508 30.84492 32.56364 138.9579   100

but it will work correctly with group_by and inside nested functions (whereas substitute-based solutions will break when nested).

r1 <- d %>% dplyr::group_by(a) %>% summarise_posterior(b, mean.b = mean(b), min=min(b))
r2 <- d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
all_equal( r1, r2 )     # TRUE

If you profile the code, you can see where the major hang ups are

Rprof()
for( i in 1:100 )
  d %>% dplyr::group_by(a) %>% summarise_posterior2(b, mean.b = mean(b), min=min(b))
Rprof(NULL)
summaryRprof()$by.self %>% head
#             self.time self.pct total.time total.pct
# ".Call"          1.84    49.73       3.18     85.95
# "sort.int"       0.94    25.41       1.12     30.27
# "eval"           0.08     2.16       3.64     98.38
# "tryCatch"       0.08     2.16       1.44     38.92
# "anyNA"          0.08     2.16       0.08      2.16
# "structure"      0.04     1.08       0.08      2.16

The .Call corresponds mainly to the C++ backend of dplyr, while sort.int is the worker behind quantile(). @jay.sf's solution gains a major speedup by decoupling from dplyr, but it also loses the associated flexibility (e.g., integration with group_by). Ultimately, it's up to you to decide which is more important.

score 1 · Answer 2 · answered Nov 11 '18 at 08:27

Why not something like this?

summarise_posterior2 <- function(data, x, ...){
  x <- deparse(substitute(x))
  nm <- deparse(substitute(...))
  M <- matrix(unlist(data[, x]), ncol=length(data[, x]))
  qs <- t(sapply(list(...), do.call, list(M)))
  'rownames<-'(cbind(p2.5 = quantile(M, prob=0.025),
        p25 = quantile(M, prob=0.25),
        p50 = quantile(M, prob=0.5),
        mean = mean(M),
        p75 = quantile(M, prob=0.75),
        p97.5 = quantile(M, prob=0.975), qs), NULL
  )
}

> summarise_posterior2(df1, X4, mean=mean, mean=mean, min=min)
     p2.5 p25 p50 mean p75 p97.5 mean mean min
[1,] 28.2  30  32   32  34  35.8   32   32  28

# > summarise_posterior(df1, X4, mean.b = mean(X4), min=min(X4))
#   p2.5 p25 p50 mean p75 p97.5 mean.b min
# 1 28.2  30  32   32  34  35.8     32  28

Runs six times faster:

> microbenchmark::microbenchmark(orig.fun=summarise_posterior(df1, X4, max(X4), min(X4)),
+                                new.fun=summarise_posterior2(df1, X4, max=max, min=min))
Unit: microseconds
     expr      min       lq      mean   median       uq      max neval
 orig.fun 4289.541 4324.490 4514.1634 4362.500 4411.225 8928.316   100
  new.fun  716.071  734.694  802.9949  755.867  778.317 4759.439   100

Data

df1 <- data.frame(matrix(1:144, 9, 16))

Do you think it would be faster to put all those quantile calls into one? It seems like your approach (while a nice improvement - don't get me wrong) still is doing about 5x extra work due to repeated sorting right? — jds, Nov 11 '18 at 19:55

score 0 · Answer 3 · answered Jun 07 '23 at 11:20

You can utilise fquantile() from the collapse package, which is a faster version of quantile() written in C.

I've written a tidyverse wrapper called q_summary to safely and efficiently calculate the quantiles.

# Uncomment code below to install timeplyr
# remotes::install_github("NicChr/timeplyr")
library(dplyr)
library(tidyr)
library(timeplyr)
d <- tibble(a = sample.int(10, 5e+05, TRUE), 
            b = rnorm(5e+05))
summarise_posterior2 <- function(data, var, ...){
  qvar <- ensym(var)
  vq <- c(0.025, 0.25, 0.5, 0.75, 0.975)
  
  summarise( data, .qq = list(quantile(!!qvar, vq, names=FALSE)),
                    .nms = list(stringr::str_c("p", vq*100)), mean = mean(!!qvar), ... ) %>%
    unnest(cols = c(.qq, .nms)) %>% 
    spread( .nms, .qq )  
}

microbenchmark::microbenchmark(
  tp = {
    mu_df <- stat_summarise(d, b, stat = c("mean", "min"))
    q_df <- q_summary(d, b, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
    
    bind_cols(mu_df, q_df)
  },
  dp = {
    summarise_posterior2(d, b, mean.b = mean(b), min=min(b))
  },
  times = 10
)
#> Unit: milliseconds
#>  expr     min      lq     mean   median      uq      max neval cld
#>    tp 52.7526 53.2848 94.40282 61.28445 67.3420 405.0795    10   a
#>    dp 50.5162 74.8126 81.88487 78.68420 83.0869 138.2517    10   a

The speed up is much more noticeable when there are many groups.

d <- tibble("a"=sample(1:10^4, 5e5, TRUE), "b"=rnorm(5e5))
microbenchmark::microbenchmark(
  tp = {
    mu_df <- stat_summarise(d, b, .by = a, stat = c("mean", "min"))
    q_df <- q_summary(d, b, .by = a, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
    
    bind_cols(mu_df, fselect(q_df, .cols = -1L))
  },
  dp = {
    summarise_posterior2(group_by(d, a), b, mean.b = mean(b), min=min(b))
  },
  times = 5
)
#> Unit: milliseconds
#>  expr       min        lq      mean    median        uq       max neval cld
#>    tp  215.7286  219.9324  285.5098  248.7057  317.6228  425.5593     5  a 
#>    dp 2601.6504 2773.1440 3306.6960 3441.4686 3816.5469 3900.6699     5   b

^{Created on 2023-06-07 with reprex v2.0.2}

Speed up Tidyverse Calculation of Multiple Quantiles

3 Answers3