1

I have a dataframe that looks like this

df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))

Month   ID     Region    Qty    Sales   Leads
April   11     East      120    1000    10
April   11     West      110    1100    12
May     12     North     110    900     9
May     10     East      110    1000    8
June    11     North     100    1000    6
June    11     East      90     800     5
June    11     West      70     650     4

I want a dataframe that looks like this

Month   ID     Qty     Sales   Leads   Region
April   11     230     2100    22      East
May     12     110     900     9       North
May     10     110     1000    8       East
June    11     260     2450    15      North

I am using a the following code

result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1) 

result$Region <- NULL

I have over 2 million such rows and it is taking forever to calculate the aggregate.

I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.

However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.

Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
FinRC
  • 133
  • 8
  • Your code cannot create your expected output. I can see you want to retain a certain order, but I don't understand why you cannot do it by arrange the data frame after summarize. – www Nov 14 '21 at 05:26
  • You said you used this code `df %>% group_by(month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1) ` (Notice that `month` should be `Month`) to generate the expected outcome. But the results from this is the same using `summarize`. The order is changed in both approaches using `mutate` or `summarize`. So I don't understand why you cannot use `summarize`. You have to arrange the data frame anyway. – www Nov 14 '21 at 05:37
  • Furthermore, you delete the `Region` column deliberately in your code with `result$Region <- NULL`, but your expected outcome contains that column. I just don't understand what you did and what is your expectations. – www Nov 14 '21 at 05:38

2 Answers2

3

summarize makes more sense to me than mutate and slice. This should save you some time.

library(dplyr)
result <- df %>%
  group_by(Month, ID) %>%
  summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
            Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups:   Month [3]
#   Month    ID   Qty Sales Leads Region
#   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
# 1 April    11   230  2100    22 East  
# 2 June     11   260  2450    15 North 
# 3 May      10   110  1000     8 East  
# 4 May      12   110   900     9 North 

Here is a data.table solution.

library(data.table)

setDT(df)

cols <- c("Qty", "Sales", "Leads")

df[, c(lapply(.SD, sum, na.rm = TRUE),
       Region = first(Region)), .SDcols = cols, 
   by = .(Month, ID)][]
#    Month ID Qty Sales Leads Region
# 1: April 11 230  2100    22   East
# 2:   May 12 110   900     9  North
# 3:   May 10 110  1000     8   East
# 4:  June 11 260  2450    15  North
www
  • 38,575
  • 12
  • 48
  • 84
  • My bad. I edited my question. I have given a reason why I used mutate and slice and I will have to retain the 'Region' column – FinRC Nov 14 '21 at 05:21
  • @FinRC I have updated my answer to keep the `Region` column. I still think `summarize` is the right way. If you need to maintain a certain order, arrange the data frame after summarize. – www Nov 14 '21 at 05:28
  • Thank you. Looks good. Is there a data.table solution? As data.frame it is taking very long again. – FinRC Nov 14 '21 at 05:52
  • Beware that the result is a grouped tibble - you might want to add `ungroup()` at the end – Donald Seinen Nov 14 '21 at 06:01
  • @FinRC I have updated my answer with a `data.table` solution. – www Nov 15 '21 at 02:48
3

We can apply generic speed-up strategies:

  1. Do less
  2. Choose an appropriate back-end
  3. Use appropriate data structures

dplyr provides syntactic sugar for data manipulation, but may not be the most efficient when it comes to handling large data sets.

solution 1

We could rewrite the code slightly to be more efficient by using the collapse package, which provides a C++ interface to dplyr functions. It prepends dplyr functions with f, with one exception fsubset which is similar to dplyr::filter (or base R subset).

library(collapse)
df |>
    fgroup_by(Month, ID) |>
    fsummarise(Qty = fsum(Qty),
               Sales = fsum(Sales),
               Leads = fsum(Leads),
               Region = fsubset(Region, 1L),
               keep.group_vars = T) |>
    as_tibble() # optional
#> # A tibble: 4 x 6
#>   Month    ID   Qty Sales Leads Region
#>   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
#> 1 April    11   230  2100    22 East  
#> 2 June     11   260  2450    15 North 
#> 3 May      10   110  1000     8 East  
#> 4 May      12   110   900     9 North 

Where |> (Requires R version > 3.5) is a slightly faster pipe than %>%. Its result is ungrouped.

solution 2

data.table is often lauded for its speed, memory use and utility. The easiest conversion from existing dplyr code to use data.table is using the dtplyr package, which ships with tidyverse. We can convert it by adding two lines of code.

library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
      group_by(Month, ID) %>%
      summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
                Region = first(Region)) %>%
      as_tibble() # or data.table()

Note that this results is an ungrouped data.frame at the end.

Benchmarks

Approaches are put in wrapper functions. dplyr here is www's approach. All approaches outputting is a tibble.

bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
            time_unit = "ms", iterations = 200)[c(1, 3,5,7)]
# A tibble: 3 x 4
  expression median mem_alloc n_itr
  <bch:expr>  <dbl> <bch:byt> <int>
1 collapse    0.316        0B   200
2 dplyr       5.42     8.73KB   195
3 dtplyr      6.67   120.21KB   196

We can see that collapse is more memory efficient, and significantly faster compared to dplyr. dtplyr approach is included here, as its time complexity is different than that of dplyr and its convenience to rewrite.

Per @www's request, an inclusion of pure data.table approach, wrapper functions rewritten for brevity. Input/ Output is a data.frame for collapse and a data.table for data.table respectively.

data.table = \(x){setDT(x); cols = c("Qty", "Sales", "Leads");x[, c(lapply(.SD, sum, na.rm = T), Region = first(Region)), .SDcols = cols, by = .(Month, ID)][]}
# retainig the `|>` pipes for readability, impact is ~4us. 
collapse = \(x) x|>fgroup_by(Month, ID)|>fsummarise(Qty = fsum(Qty),Sales = fsum(Sales),Leads = fsum(Leads),Region = fsubset(Region, 1L),keep.group_vars = T)
dt <- as.data.table(df)
bench::mark(collapse(df), iterations = 10e3)[c(1,3,5,7)] ; bench::mark(data.table(dt), iterations = 10e3)[c(1,3,5,7)]
  expression     median mem_alloc n_itr
  <bch:expr>   <bch:tm> <bch:byt> <int>
1 collapse(df)    150us        0B  9988
2 data.table(dt)  796us     146KB  9939

The difference between collapse and pure data.table, for such a small dataset, is negligible. The reason for speed increase is likely the use of fsum instead of base R sum.

Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • Thank you for providing this nice answer with comparison on several methods. If you don't mind and if you have time, could you add my `data.table` solution to your bench mark comparisons? Just curious if pure `data.table` solution could be more efficient. – www Nov 15 '21 at 02:50
  • @www pure `data.table` is likely a very efficient solution for OP, in my experience `collapse` and `data.table` are quite close in performance. But its tough to make a fair comparison as `setDT` overwrites the `data.frame`, so it doesn't incur the same data structure conversion cost every iteration of the benchmark, and also has slightly different output. Further, small sample data wont give the whole picture (i reckon DT is faster for 2M rows). What do you reckon is fair, to include `as_tibble` or give all approaches their native I/O formats? – Donald Seinen Nov 15 '21 at 04:15