0

I want to filter groups with more than 20 observations per group in my dataset (8.5M rows). I intend to translate most of my code from dplyr to data.table for the sake of efficiency. This is the operation I have chosen for data.table (I am not too familiar with the syntax yet so there may probably be a better way to write it):

df <- df[, .SD[.N > 20], by = cols]

And this is the equivalent in dplyr that I was using previously:

df <- df %>% 
  group_by(across(all_of(cols))) %>%
  filter(n() > 20) %>%
  ungroup()

The data.table option is taking 7 times more than the dplyr one, even though the df was a data.table already before any of the operations.

Why does this happen and how could I rewrite the code of data.table to make it faster?

  • 3
    Try `df[, .N, by = cols][N > 20]` – Jamie Mar 08 '23 at 16:17
  • 1
    As for why, see [here](https://stackoverflow.com/questions/16573995/subset-rows-corresponding-to-max-value-by-group-using-data-table) – Henrik Mar 08 '23 at 20:44
  • @Jamie, the operation is fast, but it doesn't produce the output I wanted. The final data.table only has the columns specified in `cols` and N. I would simply want to take out those groups and come back to my original table, as in the first code snippet of the question. – Alberto Agudo Dominguez Mar 10 '23 at 12:28

0 Answers0