We can apply generic speed-up strategies:
- Do less
- Choose an appropriate back-end
- Use appropriate data structures
dplyr
provides syntactic sugar for data manipulation, but may not be the most efficient when it comes to handling large data sets.
solution 1
We could rewrite the code slightly to be more efficient by using the collapse
package, which provides a C++ interface to dplyr
functions. It prepends dplyr
functions with f
, with one exception fsubset
which is similar to dplyr::filter
(or base R subset
).
library(collapse)
df |>
fgroup_by(Month, ID) |>
fsummarise(Qty = fsum(Qty),
Sales = fsum(Sales),
Leads = fsum(Leads),
Region = fsubset(Region, 1L),
keep.group_vars = T) |>
as_tibble() # optional
#> # A tibble: 4 x 6
#> Month ID Qty Sales Leads Region
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 April 11 230 2100 22 East
#> 2 June 11 260 2450 15 North
#> 3 May 10 110 1000 8 East
#> 4 May 12 110 900 9 North
Where |>
(Requires R version > 3.5) is a slightly faster pipe than %>%
. Its result is ungrouped.
solution 2
data.table
is often lauded for its speed, memory use and utility. The easiest conversion from existing dplyr
code to use data.table
is using the dtplyr
package, which ships with tidyverse
. We can convert it by adding two lines of code.
library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region)) %>%
as_tibble() # or data.table()
Note that this results is an ungrouped data.frame at the end.
Benchmarks
Approaches are put in wrapper functions. dplyr
here is www's approach. All approaches outputting is a tibble.
bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
time_unit = "ms", iterations = 200)[c(1, 3,5,7)]
# A tibble: 3 x 4
expression median mem_alloc n_itr
<bch:expr> <dbl> <bch:byt> <int>
1 collapse 0.316 0B 200
2 dplyr 5.42 8.73KB 195
3 dtplyr 6.67 120.21KB 196
We can see that collapse
is more memory efficient, and significantly faster compared to dplyr
. dtplyr
approach is included here, as its time complexity is different than that of dplyr
and its convenience to rewrite.
Per @www's request, an inclusion of pure data.table
approach, wrapper functions rewritten for brevity. Input/ Output is a data.frame
for collapse
and a data.table
for data.table
respectively.
data.table = \(x){setDT(x); cols = c("Qty", "Sales", "Leads");x[, c(lapply(.SD, sum, na.rm = T), Region = first(Region)), .SDcols = cols, by = .(Month, ID)][]}
# retainig the `|>` pipes for readability, impact is ~4us.
collapse = \(x) x|>fgroup_by(Month, ID)|>fsummarise(Qty = fsum(Qty),Sales = fsum(Sales),Leads = fsum(Leads),Region = fsubset(Region, 1L),keep.group_vars = T)
dt <- as.data.table(df)
bench::mark(collapse(df), iterations = 10e3)[c(1,3,5,7)] ; bench::mark(data.table(dt), iterations = 10e3)[c(1,3,5,7)]
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 collapse(df) 150us 0B 9988
2 data.table(dt) 796us 146KB 9939
The difference between collapse
and pure data.table
, for such a small dataset, is negligible. The reason for speed increase is likely the use of fsum
instead of base R sum
.