Why is filtering by groups bigger than x observations taking longer in data.table than in dplyr?

Question

I want to filter groups with more than 20 observations per group in my dataset (8.5M rows). I intend to translate most of my code from dplyr to data.table for the sake of efficiency. This is the operation I have chosen for data.table (I am not too familiar with the syntax yet so there may probably be a better way to write it):

df <- df[, .SD[.N > 20], by = cols]

And this is the equivalent in dplyr that I was using previously:

df <- df %>% 
  group_by(across(all_of(cols))) %>%
  filter(n() > 20) %>%
  ungroup()

The data.table option is taking 7 times more than the dplyr one, even though the df was a data.table already before any of the operations.

Why does this happen and how could I rewrite the code of data.table to make it faster?

As for why, see [here](https://stackoverflow.com/questions/16573995/subset-rows-corresponding-to-max-value-by-group-using-data-table) — Henrik, Mar 08 '23 at 20:44
@Jamie, the operation is fast, but it doesn't produce the output I wanted. The final data.table only has the columns specified in `cols` and N. I would simply want to take out those groups and come back to my original table, as in the first code snippet of the question. — Alberto Agudo Dominguez, Mar 10 '23 at 12:28

Why is filtering by groups bigger than x observations taking longer in data.table than in dplyr?

0 Answers0