The OP doesn't include a way to scale up the benchmark, so I'm making my own:
library(data.table)
library(dplyr)
n = 3e6
n_days = 20000
set.seed(10)
dat <- data_frame(
date = sample(
seq(as.Date('1999/01/01'), as.Date('1999/01/01') + n_days - 1, by="day")
, n, TRUE),
cash = sample(1010:1030, n, TRUE),
stage = sample(LETTERS[1:7], n, TRUE)
) %>% distinct()
DT = data.table(dat)[, date := as.IDate(date)]
Testing:
# OP's approach
system.time(
res <- dat %>%
mutate(
year = format(date, '%Y'),
month = format(date, '%B')
) %>%
arrange(desc(date)) %>%
group_by(cash, stage, year, month) %>%
slice(1)
)
# user system elapsed
# 9.44 0.09 9.54
# a data.table way
system.time({
DTres <- DT[, g := date - mday(date) + 1L ][order(-date), .SD[1L], by=.(cash, stage, g)]
})
# user system elapsed
# 0.51 0.00 0.52
# verify
fsetequal(
data.table(res[, c("cash","stage","date")])[, date := as.IDate(date)][],
DTres[, c("cash","stage","date")]
) # TRUE
Translating back to dplyr:
system.time({
newres <- dat %>% mutate(g = date - as.POSIXlt(date)$mday + 1) %>%
arrange(desc(date)) %>% group_by(cash, stage, g) %>% slice(1L)
})
# Error, who knows why?
system.time({
newres <- dat %>% mutate(g = date + 1 - date %>% as.POSIXlt %>% `[[`("mday")) %>%
arrange(desc(date)) %>% group_by(cash, stage, g) %>% slice(1L)
})
# user system elapsed
# 1.47 0.04 1.52
fsetequal(
data.table(res[c("date","cash","stage")]),
data.table(newres[c("date","cash","stage")])
) # TRUE
# or ...
iddat <- dat %>% mutate(date = data.table::as.IDate(date))
mday <- data.table::mday
system.time({
borrowres <- iddat %>% arrange(desc(date)) %>%
distinct(cash, stage, g = date - mday(date) + 1L)
})
# user system elapsed
# 0.92 0.02 0.94
fsetequal(
data.table(borrowres[names(DTres)]),
DTres
) # TRUE
The relative timings didn't change much as I tweaked n
and n_days
. Thanks to @Arun for this way of rounding. Previously, I had round(date, "months")
. It seems that the key is using arithmetic instead of format
. I'm not sure about the remaining difference in the timings; maybe it can be addressed by using dtplyr. The switch to arrange %>% distinct
didn't do much beyond cleaning up the syntax.
Side note: I'm loading dplyr instead of tidyverse because I don't really know what the latter contains. I tried it with tidyverse and got the same timings, though.