Group_by and mutate slow on large dataframe

Question

I am working with large (min 8 mil rows) dataframes and want to do some basic calculations based on a couple grouping variables and rmultinom. As my code stands it takes at least ~1 sec to complete the calculation, which wouldn't be a problem but I need to do it thousands of times so I would really like to speed this up.

I am currently using dataframes and tidyverse but I am not tied to either of these. I have tried to implement using data.table but couldn't figure it out. Any suggestions of how I could speed things up would be much appreciated.

An example (the real data can be an order of magnitude larger or more):

library(tidyverse)
library(microbenchmark)

# create dummy data
df <- data.frame(fact = rep(letters, each = 312000), 
                 month = rep(month.name, 26), 
                 num = rep(sample(10000:100000, 12), 26), 
                 prob = runif(312))

# Order by month     
df <- df[order(df$month), ]

# group by two factor variables and calculate new variable 
microbenchmark({
  df2 <- df %>%
    group_by(fact, month) %>%
    mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))}, times = 10)


 > Unit: milliseconds
 > min      lq       mean     median   uq        max         neval
 > 816.3126 822.4083 840.7966 834.6163 855.5139  879.9345    10

score 4 · Answer 1 · answered Oct 08 '18 at 22:02

4

A bit too long for a comment, so I post it here.

Running

library(profr)
plot(profr(
df %>% group_by(fact, month) %>% 
   mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))
))

I get the following:

So, it looks like you really want to find a faster implementation for multinom, which seems to be the bottleneck. This bottleneck is the same for both dplyr and data.table, which means only speeding up rmultinorm will give you substantial speed improvements.

answered Oct 08 '18 at 22:02

coffeinjunky

11,254
39
57

OK thanks, after further investigation another user had the same problem with `rmultinom` [here](https://stackoverflow.com/questions/23097269/efficient-multinomial-sampling-when-sample-size-and-probability-vary), they implemented a solution using `Rcpp` which I haven't used but will look into. – flee Oct 08 '18 at 22:35

score 2 · Answer 2 · answered Oct 08 '18 at 22:58

You might as well reduce the overhead caused by the pipe operator, in both dplyr and DT syntax.

To illustrate the overhead caused by pipes:

microbenchmark(pipe = iris %>%
                 group_by(Species) %>% 
                 mutate(mean = mean(Sepal.Length)),
               no_pipe = mutate(group_by(iris, Species), mean = mean(Sepal.Length)),
               times = 100) %>% autoplot()

score 1 · Answer 3 · answered Oct 08 '18 at 20:33

Using data.table, you could do:

dt <- copy(df)
setDT(dt)

dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)]

Which gives you a minor speed improvement:

microbenchmark(dp = df %>%
                 group_by(fact, month) %>%
                 mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0)),
               dt = dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)],
               times = 1)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
   dp 1.356745 1.356745 1.356745 1.356745 1.356745 1.356745     1
   dt 1.063363 1.063363 1.063363 1.063363 1.063363 1.063363     1

@Jaap Thank you, it all adds up! Do you think I would be able to make further substantial improvements or is this likely about as good as I'll get? Need to decide if I keep looking for improvements or just move on. Cheers. — flee, Oct 08 '18 at 20:44
hmm I just ran your code on my machine and I didn't get any performance gain at all. — flee, Oct 08 '18 at 20:51

Group_by and mutate slow on large dataframe

3 Answers3