Group efficiently with data.table in R

Question

It would be possible to abbreviate the following script:

not to use so many chained operations.
Avoid using .SD and by as much as possible

library(data.table)
DT<-structure(list(title = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d", "d"), date = c("12-07-2020", "13-07-2020", "14-07-2020", "15-07-2020", "12-07-2020", "13-07-2020", 
         "14-07-2020", "15-07-2020", "12-07-2020", "13-07-2020", "14-07-2020", "15-07-2020", 
         "12-07-2020", "13-07-2020", "14-07-2020", "15-07-2020"), 
bucket = c(1, 1, 1, 4, 9, 7, 10, 10, 8, 5, 5, 5, 8, 10, 9, 10), 
score = c(86, 22, 24, 54, 66, 76, 43, 97, 9, 53, 45, 40, 21, 99, 91, 90)),
 row.names = c(NA, -16L), class = c("data.table","data.frame"))
 
DT[DT[, .I[bucket == min(bucket)], by = title]$V1]
DT[, .SD[which(bucket == min(bucket))], by =title][,
  `:=`(avg_score = mean(score)), by = .(title)][,
    .SD[.N,c(1,2,4)], by = .(title)]

the original code is a scrip using dplyr .:RStudio Community

tt <- data %>% 
group_by(title) %>% 
filter(bucket == min(bucket)) %>% 
mutate(avg_score = mean(score)) %>% 
slice_max(date) %>% 
select(-score)
>
title date       bucket avg_score
  <chr> <chr>       <dbl>     <dbl>
1 a     14-07-2020      1        44
2 b     13-07-2020      7        76
3 c     15-07-2020      5        46
4 d     12-07-2020      8        21
>

Hi @HerClau! Thanks for posting a small example and the code you have tried. Can you please also describe _in words_ what you try to achieve. This will make it easier for people to help you and also make your post more valuable to future visitors. Cheers. — Henrik, Jul 17 '20 at 09:15
Could I understand why you want to avoid using `.SD` and `by` "as much as possible"? — MichaelChirico, Jul 17 '20 at 13:42
Agree with @MichaelChirico. You also need to clarify what you mean by _efficiently_ in "Group efficiently". I _think_ the way you have selected relevant rows per group in `i` (`DT[, .I[bucket == min(bucket)], by = title]$V1`) is a canonical method; see e.g. [How to select the row with the maximum value in each group](https://stackoverflow.com/a/24558696/1851712), [Subset by group with data.table](https://stackoverflow.com/a/16574176/1851712). — Henrik, Jul 17 '20 at 14:13
Could you comment on the need for $ V1 in the canonical method: DT [, .I [bucket == min (bucket)], by = title] $ V1) and why DT [bucket == min (bucket)], .SD, by = title]; it only returns the first group. For me it was a surprise. Please understand that I have been using data.table and R in particular for a short time. I would like to understand or understand these two doubts — HerClau, Jul 17 '20 at 15:51
`$V1` selects the column `V1` from a `data.table` and returns a vector. If you do a calculation `by` group and don't name the result, `V1` is the default name. Play with a smaller example, and break it down in steps: `d = data.table(x = c(2, 1, 4, 3), g = c("a", "a", "b", "b"))`; Use `.I` to get row names: `d[ , .I]`; `min` by group: `d[ , min(x), by = g]`; check if `x` is min: `d[ , x == min(x), by = g]` - note the `V1`; index `.I` with the previous boolean: `d[ , .I[x == min(x)], by = g]`, again the default name `V1`. — Henrik, Jul 17 '20 at 16:56
We could have actively named the result, e.g. `d[ , .(min_rows = .I[x == min(x)]), by = g]`, but we are lazy, and soon we will remember that the default name is `V1`. Extract the desired "min rows" as a vector: `d[ , .I[x == min(x)], by = g]$V1`. Finally, use the index vector to subset the original data in `i` (the first slot): `d[d[ , .I[x == min(x)], by = g]$V1]`. — Henrik, Jul 17 '20 at 17:01
"why `DT [bucket == min (bucket), .SD, by = title]`; it only returns the first group"; because you have `bucket == min (bucket)` in the `i` slot (the first slot), the `min` calculation is not performed by group, and you only select the rows which are equal to the _overall_ min. — Henrik, Jul 17 '20 at 17:10
Because you are new to `data.table` I can recommend to work through the examples in [the intro vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html), and in [this very nice tutorial](http://franknarf1.github.io/r-tutorial/_book/tables.html#tables). Good luck! — Henrik, Jul 17 '20 at 17:14

score 2 · Answer 1 · answered Jul 17 '20 at 10:34

2

Here is a solution with no chaining nor .SD:

# Convert from character to Date to be able to select the max
DT[, date := as.Date(date, "%d-%m-%Y")]

DT[,
   {
     mb <- which(bucket == min(bucket))
     .(
       date = max(date[mb]), bucket = bucket[mb][1L], avg_score = mean(score[mb])
     )
   }, 
   by = title]

#    title       date bucket avg_score
# 1:     a 2020-07-14      1        44
# 2:     b 2020-07-13      7        76
# 3:     c 2020-07-15      5        46
# 4:     d 2020-07-12      8        21

answered Jul 17 '20 at 10:34

s_baldur

29,441
4
36
69

sindri_baldur, Thanks for your code, it is easier and more readable than the one I show. I really like using the data.table package, and for the first time, in my little experience, I see a dplyr code more comfortable to read, to solve the task. Your code has made me happy! – HerClau Jul 17 '20 at 15:56
Following the idea of @sindri_baldur: ```DT [, {V1 <- which (bucket == min (bucket)); date = max (date [V1]); avg_score = mean (score [V1]); list (date = date, avg_score = avg_score)}, by = title]``` # 6 ```DT [, {V1 <- bucket == min (bucket); date = max (date [V1]); avg_score = mean (score [V1]); list (date = date, avg_score = avg_score)}, by = title]``` – HerClau Jul 18 '20 at 19:51

Group efficiently with data.table in R

1 Answers1