1

I'm working on trying to improve my code by using the tips outlined in this recent blog post https://www.tidyverse.org/blog/2023/04/performant-packages/ . I've managed to replace some of my simpler filter and mutate calls for slightly speedier code. However there is one section where I can't figure out how I would go about doing this, and would love some guidance if at all possible.

From the blog post they mentioned vec_chop, list_chop and vec_rep_each, but I haven't managed to figure out how I would use the indices to do this over a large dataset, let alone a small one

df2 <- df1 %>% 
  group_by(clnt_label, term1, term2) %>% 
  filter(row_number() == 1)

dummy data

df1 <- tibble(clnt_label = rep(LETTERS[1:3], each = 10),
              term1 = rev(rep(LETTERS[1:3], times = 10)),
              term2 = rep(LETTERS[1:3], each = 5, times = 2))

Any thoughts/advice would be appreciated!

EDIT:

Tried out a solution mentioned by Axeman and had an idea of my own. Tested this on my full dataset instead of the dummy set, and found using distinct() to be the fastest approach of the ones I've tested so far.

microbenchmark::microbenchmark(
all_pairs4_old <- all_pairs3 %>% 
  group_by(clnt_label, term1, term2) %>% 
  filter(row_number() == 1),
times = 1, unit = "millisecond")
# Unit: milliseconds
# min       lq     mean   median       neval
# 31938.37 31938.37 31938.37 31938.37   1

microbenchmark::microbenchmark(
  all_pairs4_head <- all_pairs3 %>% 
    group_by(clnt_label, term1, term2) %>% 
    slice_head(n = 1),
  times = 1, unit = "millisecond")  
# Unit: milliseconds
# min       lq       mean    median    neval
# 214474.4 214474.4 214474.4 214474.4  1

microbenchmark::microbenchmark(
  all_pairs4_slice <- all_pairs3 %>% 
    group_by(clnt_label, term1, term2) %>% 
    slice(1),
  times = 1, unit = "millisecond") 
# Unit: milliseconds
# min       lq       mean     median      neval
# 144225.7 144225.7 144225.7 144225.7     1

microbenchmark::microbenchmark(
  all_pairs4_distinct <- all_pairs3 %>% 
    distinct(clnt_label, term1, term2, .keep_all = TRUE),
  times = 1, unit = "millisecond") 
# Unit: milliseconds
# min       lq       mean     median   neval
# 242.9775 242.9775 242.9775 242.9775  1
Dave R
  • 202
  • 1
  • 8
  • 1
    `filter`ing by index should usually be done by slice, so I'd simply use `slice(1)` which should also be more performant. If you really need speed (e.g. because you have many groups), you can gain a lot of performance by using `data.table`, optionally with `dtplyr`. – Axeman Apr 27 '23 at 21:36
  • Either use data.table or Rcpp if speed is an issue – Onyambu Apr 27 '23 at 22:44

0 Answers0