3

What is the recommended / "best" way (wrt performance) to filter a data.table based on some criteria calculated on an aggregated form of this very table.

A reprex speaks more than 1000 words:

library(data.table)
DT <- data.table(grp = rep(LETTERS[1:3], each = 3), x = 1:9)
setkey(DT, "grp")
DT[DT[, .(nok = any(x == 4)), by = grp][nok == FALSE]]
DT[DT[, .GRP[all(x != 4)], by = grp]]

I could think of these 2 solutions immediately and my gut feeling tells me the second form should be "better" (smaller intermediate tables are stored and I do not need to chain the results), but I was wondering if there is a canonical form of doing this?

Maybe I do not need to use a join in the first place and can use a grouped filter for the i argument?

This does obviously not work as intended (by has apparently only an effect on j):

DT[all(x != 4), by = grp]

While this SO answer shows yet another way of doing the same, my main concern is about the performance. Thus, I would like to know which of these options will generally scale well to large tables, if I want to further work on the filtered data.table (that is use another j expression on the filtered result)

In my real case scenario, I have about 16 Mio rows, with about 40k unique keys and 14 columns.

Thus a benchmark data set could look as follows:

bench <- data.table(keys = rep(paste0("k", 1:40000), 400))
bench[, paste0("cols", 1:13) := replicate(13, sample(40000 * 400, TRUE), 
                                          simplify = FALSE)]

While I am looking for a generic answer (if possible) irrespective of the final filter selected, the actual filter will be to find out which groups contain any NA value.

thothal
  • 16,690
  • 3
  • 36
  • 71
  • Does this answer your question? [filter rows in data.table with \`by\`](https://stackoverflow.com/questions/34393053/filter-rows-in-data-table-with-by) – arg0naut91 Apr 23 '20 at 11:01
  • Partly, yes it does the filtering, but it does not answer which of the many options shows the highest performance. – thothal Apr 23 '20 at 11:10
  • Then I'd update the question with some actual benchmarks and data suitable for comparing. Currently "performance" is mentioned nowhere. – arg0naut91 Apr 23 '20 at 11:11
  • Hm, actually it does in the first line: "What is the recommended / "best" way (wrt performance)", but I added an update to make it even clearer. – thothal Apr 23 '20 at 11:13
  • 1
    I take that back, however it wasn't really clear from the overall post. Would also be great if you could provide a representative dataset - 'performance' can be quite relative given how many columns / rows / groups you have and different variations ... – arg0naut91 Apr 23 '20 at 11:17
  • 1
    Added a benchmark data set and clarified the dimensions of my real case scenario. – thothal Apr 23 '20 at 11:28

2 Answers2

2

I learned from this post

You can do this.

DT[DT[,.I[all(x!=4)],by=.(grp)]$V1,]
Frank Zhang
  • 1,670
  • 7
  • 14
1

I do not think that there is a generic way that applies to all situations. It depends on the characteristics of the dataset and also the filtering criteria. The dataset might have multiple small groups or integer keys while the filtering can be implemented using a faster low-level programming approach.

Here are a few options with regards to your actual problem (i.e. filtering for groups with NAs in one of the columns):

DT_keys <- copy(DT)
system.time(setkey(DT_keys, keys))
#   user  system elapsed 
#   1.50    0.67    1.32 

DT_cols1 <- copy(DT)
system.time(setkey(DT_cols1, cols1))
#   user  system elapsed 
#   4.21    0.21    1.30 

microbenchmark::microbenchmark(times=1L,
    m0 = DT_keys[, keys[is.na(cols1)], keys]$keys,
    m1 = DT_keys[, if (anyNA(cols1)) keys, keys]$keys,
    m2 = DT_cols1[.(NA_integer_)]$keys
)

timings for the 16 million rows of dummy data:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
   m0 90.675005 90.675005 90.675005 90.675005 90.675005 90.675005     1
   m1 56.548620 56.548620 56.548620 56.548620 56.548620 56.548620     1
   m2  4.010301  4.010301  4.010301  4.010301  4.010301  4.010301     1

The timings are very fast for the size of your actual dataset. Not much significant time to be saved unless you run the filtering hundreds of time. Maybe you might want to save some other type of timings beside runtime.

data:

library(data.table)
set.seed(0L)
nk <- 4e4L
nn <- 400L
DT <- data.table(keys = rep(paste0("k", 1L:nk), nn))
DT[, paste0("cols", 1L:13L) := 
    replicate(13L, sample(c(NA_integer_, 1L:nk), nk * nn, TRUE), simplify = FALSE)]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Thanks for your elaborated answer +1. For my full understanding: are you suggetsing that the "best" way to finally filter the original data set is to use a join of this aggregated table with the original one `DT[DT_cols1[.(NA_integer_)]$keys]`? – thothal Apr 24 '20 at 07:41
  • will you be running the filtering many times? will you be facing any memory issues if we dupe the dataset twice? – chinsoon12 Apr 24 '20 at 07:52
  • 1
    Actually the filtering will be run only once. Memory could start to be an issue. So maybe having the solution where we use the same key is preferable. Thanks your answer helped me a lot to get more komfortable with `data.table`(being more of a `dplyr` guy myself). – thothal Apr 24 '20 at 08:07
  • then I would suggest `setkey(DT, keys)` once and then use `DT[DT[, if (anyNA(cols1)) keys, keys]$keys]` – chinsoon12 Apr 24 '20 at 08:08