Two very fast collapse
options are GRPN
and fcount
. fcount
is a fast version of dplyr::count
and uses the same syntax. You can use add = TRUE
to add it a as a column (mutate
-like):
library(collapse)
fcount(df1, Year, Month) #or df1 %>% fcount(Year, Month)
# Year Month N
# 1 2012 Feb 4
# 2 2014 Jan 3
# 3 2013 Mar 2
# 4 2013 Feb 2
# 5 2012 Jan 2
# 6 2012 Mar 2
# 7 2013 Jan 1
# 8 2014 Feb 3
# 9 2014 Mar 1
GRPN
is closer to collapse
's original syntax. First, group the data with GRP
. Then use GRPN
. By default, GRPN
creates an expanded vector that match the original data. (In dplyr
, it would be equivalent to using mutate
). Use expand = FALSE
to output the summarized vector.
library(collapse)
GRPN(GRP(df1, .c(Year, Month)), expand = FALSE)
Microbenchmark with a 100,000 x 3 data frame and 4997 different groups.
collapse::fcount
is much faster than any other option.
library(collapse)
library(dplyr)
library(data.table)
library(microbenchmark)
set.seed(1)
df <- data.frame(x = gl(1000, 100),
y = rbinom(100000, 4, .5),
z = runif(100000))
dt <- df
mb <-
microbenchmark(
aggregate = aggregate(z ~ x + y, data = df, FUN = length),
count = count(df, x, y),
data.table = setDT(dt)[, .N, by = .(x, y)],
'collapse::fnobs' = df %>% fgroup_by(x, y) %>% fsummarise(number = fnobs(z)),
'collapse::GRPN' = GRPN(GRP(df, .c(x, y)), expand = FALSE),
'collapse::fcount' = fcount(df, x, y)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# aggregate 159.5459 203.87385 227.787186 223.93050 246.36025 335.0302 100
# count 55.1765 63.83560 74.715889 73.60195 79.20170 196.8888 100
# data.table 8.4483 15.57120 18.308277 18.10790 20.65460 31.2666 100
# collapse::fnobs 3.3325 4.16145 5.695979 5.18225 6.27720 22.7697 100
# collapse::GRPN 3.0254 3.80890 4.844727 4.59445 5.50995 13.6649 100
# collapse::fcount 1.2222 1.57395 3.087526 1.89540 2.47955 22.5756 100