I am working with large (min 8 mil rows) dataframes
and want to do some basic calculations based on a couple grouping variables and rmultinom
. As my code stands it takes at least ~1 sec to complete the calculation, which wouldn't be a problem but I need to do it thousands of times so I would really like to speed this up.
I am currently using dataframes
and tidyverse
but I am not tied to either of these. I have tried to implement using data.table
but couldn't figure it out. Any suggestions of how I could speed things up would be much appreciated.
An example (the real data can be an order of magnitude larger or more):
library(tidyverse)
library(microbenchmark)
# create dummy data
df <- data.frame(fact = rep(letters, each = 312000),
month = rep(month.name, 26),
num = rep(sample(10000:100000, 12), 26),
prob = runif(312))
# Order by month
df <- df[order(df$month), ]
# group by two factor variables and calculate new variable
microbenchmark({
df2 <- df %>%
group_by(fact, month) %>%
mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))}, times = 10)
> Unit: milliseconds
> min lq mean median uq max neval
> 816.3126 822.4083 840.7966 834.6163 855.5139 879.9345 10