I want to filter groups with more than 20 observations per group in my dataset (8.5M rows). I intend to translate most of my code from dplyr
to data.table
for the sake of efficiency.
This is the operation I have chosen for data.table
(I am not too familiar with the syntax yet so there may probably be a better way to write it):
df <- df[, .SD[.N > 20], by = cols]
And this is the equivalent in dplyr
that I was using previously:
df <- df %>%
group_by(across(all_of(cols))) %>%
filter(n() > 20) %>%
ungroup()
The data.table
option is taking 7 times more than the dplyr
one, even though the df was a data.table already before any of the operations.
Why does this happen and how could I rewrite the code of data.table
to make it faster?