vctrs::vec_chop method to replace filter(row_number() == 1)

Question

I'm working on trying to improve my code by using the tips outlined in this recent blog post https://www.tidyverse.org/blog/2023/04/performant-packages/ . I've managed to replace some of my simpler filter and mutate calls for slightly speedier code. However there is one section where I can't figure out how I would go about doing this, and would love some guidance if at all possible.

From the blog post they mentioned vec_chop, list_chop and vec_rep_each, but I haven't managed to figure out how I would use the indices to do this over a large dataset, let alone a small one

df2 <- df1 %>% 
  group_by(clnt_label, term1, term2) %>% 
  filter(row_number() == 1)

dummy data

df1 <- tibble(clnt_label = rep(LETTERS[1:3], each = 10),
              term1 = rev(rep(LETTERS[1:3], times = 10)),
              term2 = rep(LETTERS[1:3], each = 5, times = 2))

Any thoughts/advice would be appreciated!

EDIT:

Tried out a solution mentioned by Axeman and had an idea of my own. Tested this on my full dataset instead of the dummy set, and found using distinct() to be the fastest approach of the ones I've tested so far.

microbenchmark::microbenchmark(
all_pairs4_old <- all_pairs3 %>% 
  group_by(clnt_label, term1, term2) %>% 
  filter(row_number() == 1),
times = 1, unit = "millisecond")
# Unit: milliseconds
# min       lq     mean   median       neval
# 31938.37 31938.37 31938.37 31938.37   1

microbenchmark::microbenchmark(
  all_pairs4_head <- all_pairs3 %>% 
    group_by(clnt_label, term1, term2) %>% 
    slice_head(n = 1),
  times = 1, unit = "millisecond")  
# Unit: milliseconds
# min       lq       mean    median    neval
# 214474.4 214474.4 214474.4 214474.4  1

microbenchmark::microbenchmark(
  all_pairs4_slice <- all_pairs3 %>% 
    group_by(clnt_label, term1, term2) %>% 
    slice(1),
  times = 1, unit = "millisecond") 
# Unit: milliseconds
# min       lq       mean     median      neval
# 144225.7 144225.7 144225.7 144225.7     1

microbenchmark::microbenchmark(
  all_pairs4_distinct <- all_pairs3 %>% 
    distinct(clnt_label, term1, term2, .keep_all = TRUE),
  times = 1, unit = "millisecond") 
# Unit: milliseconds
# min       lq       mean     median   neval
# 242.9775 242.9775 242.9775 242.9775  1

`filter`ing by index should usually be done by slice, so I'd simply use `slice(1)` which should also be more performant. If you really need speed (e.g. because you have many groups), you can gain a lot of performance by using `data.table`, optionally with `dtplyr`. — Axeman, Apr 27 '23 at 21:36

vctrs::vec_chop method to replace filter(row_number() == 1)

0 Answers0