4

I want to calculate the quantiles of each row of a data frame and return the result as a matrix. Since I want to calculate and arbitrary number of quantiles (and I imagine that it is faster to calculate them all at once, rather than re-running the function), I tried using a formula I found in this question:

library(dplyr)
df<- as.data.frame(matrix(rbinom(1000,10,0.5),nrow = 2))

interim_res <- df %>% 
              rowwise() %>% 
              do(out = sapply(min(df):max(df), function(i) sum(i==.)))

interim_res <- interim_res[[1]] %>% do.call(rbind,.) %>% as.data.frame(.)

This makes sense, but when I try to apply the same framework to the quantile() function, as coded here,

interim_res <- df %>% 
              rowwise() %>% 
              do(out = quantile(.,probs = c(0.1,0.5,0.9)))

interim_res <- interim_res[[1]] %>% do.call(rbind,.) %>% as.data.frame(.)

I get this error message:

Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :

'x' must be atomic

Why am I getting an error with quantile and not sum? How should I fix this issue?

Community
  • 1
  • 1
Max Candocia
  • 4,294
  • 35
  • 58
  • In your example, the quantiles are computed across the variables in the data.frame and not across the observations. This is fairly unusual. Are you sure this is what you wanted? – akhmed May 13 '15 at 21:48
  • The columns are results of a simulation and the rows are different parameter configurations. – Max Candocia May 13 '15 at 21:50

1 Answers1

8

. in do is a data frame, which is why you get the error. This works:

df %>% 
  rowwise() %>% 
  do(data.frame(as.list(quantile(unlist(.),probs = c(0.1,0.5,0.9)))))

but risks being horrendously slow. Why not just:

apply(df, 1, quantile, probs = c(0.1,0.5,0.9))

Here are some timings with larger data:

df <- as.data.frame(matrix(rbinom(100000,10,0.5),nrow = 1000))

library(microbenchmark)
microbenchmark(
  df %>% rowwise() %>% do(data.frame(as.list(quantile(unlist(.),probs = c(0.1,0.5,0.9))))),
  apply(df, 1, quantile, probs = c(0.1,0.5,0.9)),
  times=5
) 

Produces:

            min        lq      mean    median        uq       max neval
dplyr 2375.2319 2376.6658 2446.4070 2419.4561 2454.6017 2606.0794     5
apply  224.7869  231.7193  246.7137  233.4757  245.0718  298.5144     5    

If you go the apply route you should probably stick with a matrix from the get go.

BrodieG
  • 51,669
  • 9
  • 93
  • 146