What is the recommended / "best" way (wrt performance) to filter a data.table
based on some criteria calculated on an aggregated form of this very table.
A reprex
speaks more than 1000 words:
library(data.table)
DT <- data.table(grp = rep(LETTERS[1:3], each = 3), x = 1:9)
setkey(DT, "grp")
DT[DT[, .(nok = any(x == 4)), by = grp][nok == FALSE]]
DT[DT[, .GRP[all(x != 4)], by = grp]]
I could think of these 2 solutions immediately and my gut feeling tells me the second form should be "better" (smaller intermediate tables are stored and I do not need to chain the results), but I was wondering if there is a canonical form of doing this?
Maybe I do not need to use a join in the first place and can use a grouped filter for the i
argument?
This does obviously not work as intended (by
has apparently only an effect on j
):
DT[all(x != 4), by = grp]
While this SO answer shows yet another way of doing the same, my main concern is about the performance. Thus, I would like to know which of these options will generally scale well to large tables, if I want to further work on the filtered data.table (that is use another j
expression on the filtered result)
In my real case scenario, I have about 16 Mio rows, with about 40k unique keys and 14 columns.
Thus a benchmark data set could look as follows:
bench <- data.table(keys = rep(paste0("k", 1:40000), 400))
bench[, paste0("cols", 1:13) := replicate(13, sample(40000 * 400, TRUE),
simplify = FALSE)]
While I am looking for a generic answer (if possible) irrespective of the final filter selected, the actual filter will be to find out which groups contain any NA
value.