4

I created some code where I need to do a lot of joins in a loop with about 1M rows. At first I was making shallow copies with a left join, but I changed the code to make update join and it became slow.

Some test data

a=data.table(id=1:10^6)
b=data.table(id=1:10^6, t='y')

cols=setdiff(names(b),names(a))
c=copy(b[a,on=.(id)])
a[b,(cols):=mget(cols),on=.(id)]
all.equal(a,c)
[1]TRUE

I don't know if this is the correct way to compare times, but here it is

shallow copy

microbenchmark(c=b[a,on=.(id)])
Unit: milliseconds
expr      min      lq     mean   median       uq      max neval
c 120.5911 122.734 128.1071 124.0922 126.1449 179.0127   100

updating the data.table

microbenchmark(a[b,(cols):=mget(cols),on=.(id)])
Unit: milliseconds
expr      min       lq     mean   median       uq      max
a[b, `:=`((cols), mget(cols)), on = .(id)] 159.6128 162.9798 175.9855 168.5807 184.8756 308.5406
neval
100

altrough I will need a solution to update multiple columns, don't know why but, using mget is slower than below approach

microbenchmark(a[b, t:=i.t, on=.(id)])
Unit: milliseconds
expr      min       lq     mean   median      uq      max neval
a[b, `:=`(t, i.t), on = .(id)] 138.0173 140.4671 146.3558 141.6204 143.281 185.1488   100

In this case I'm not concerned about memory usage just speed. Is there a way to update the original table without copies faster than the copy method?

R. Cowboy
  • 223
  • 1
  • 7
  • 2
    Don't use mget as it may be subefficient, use env var instead – jangorecki Aug 12 '21 at 07:53
  • @jangorecki I will be studying about environments, but could you provide some example how to use it? I using it inside a function where the names in `cols`change based in the values provided by the user. Another doubt that I had right now with `mget` and don't know if you can help me, but if I use it inside a function and have the same object name outside the function it will always access the object created inside the function? I did some test and it's whats happened, but reading the help it's not clear. – R. Cowboy Aug 12 '21 at 11:49
  • 1
    @jangorecki, while I'm familiar with envvars (`..varname` notation for up-one-level access) and newly informed on the wip `env=list(...)` argument, I cannot get something like `dfa[dfb, c(cols) := .(..cols), on = .(id)]` to work correctly. I'm guessing it's a subtle problem I'm not seeing, but I think it would be really helpful if you could expand on your first comment. Thanks! (R.Cowboy, I renamed the vars to `dfa` and `dfb`, since I really dislike having an object named `c` in R, so I renamed all three.) – r2evans Aug 12 '21 at 12:47
  • (If it matters, the expression in my previous comment functions without error but assigns a new column `t` (correct) with the literal value `"t"` (incorrect) instead of the *contents* of `dfb[,cols,with=FALSE]`.) – r2evans Aug 12 '21 at 12:49
  • @r2evans the way you suggested without `mget`seems to work with a single column in `dfb` but when I create another one and try to join this way it gives an error ... don't know if I need to do some adjust – R. Cowboy Aug 12 '21 at 14:37
  • AFAICT documentation and examples covers that use case very well. Look for env=list(.rhs=as.list(cols)). Also use verbose=TRUE for debugging env argument. – jangorecki Aug 12 '21 at 16:09
  • @Jangoreski, do you confirm that new `env` parameter is only in [Unreleased data.table v1.14.1 (in development)](https://rdatatable.gitlab.io/data.table/news/index.html) – Waldi Aug 13 '21 at 07:27
  • @jangorecki, while I am interested to play with `env=`, I am not comfortable working with a WIP package *in production* without significant testing on my own packages and code; this is not a reflection on `data.table`, just my process of reproducible processing/reports (console, `plumber`, `shiny`, and kafka-streaming) and an aversion to update any package arbitrarily. I think this use-case is relevant enough to determine ***if*** it can be done pre-`env=list(..)`; the answer may be "no". You are among few I would trust whole-heartedly to find a good method if one exists. Thanks! – r2evans Aug 13 '21 at 12:22
  • 1
    See also https://stackoverflow.com/questions/60263274/should-i-use-mget-or-with-false-to-select-columns-of-a-data-table. `mget` is slow because internal code needs to allocate memory to the `.SD` variable. I will post a solution that is a less user friendly way to do `env = list(...)`. – Cole Aug 13 '21 at 23:26
  • 1
    env arg is just friendly way to use base R substitute, @Cole answer is great! – jangorecki Aug 14 '21 at 19:29

1 Answers1

4

Here is a language way to approach it. Basically, is awesome and does its best to optimize. However, other programmatic trickery like mget makes data.table less efficient as it has to assume that the j expression could include any column and therefore it needs to allocate memory to evaluate.

This translates mget into what is actually happening using language: a[b, (cols) = list(t)].

nms = lapply(cols, as.name)
eval(substitute(a[b, (cols):= .x, on = .(id)], list(.x = as.call(c(quote(list), nms)))))

## performance against mget()

bench::mark(lang = {
  nms = lapply(cols, as.name)
  eval(substitute(a[b, (cols):= .x, on = .(id),], list(.x = as.call(c(quote(list), nms)))))
}
, mget_approach = a[b,(cols):=mget(cols),on=.(id)]
, normal =  b2[a2,on=.(id)]
  )

# A tibble: 3 x 13
  expression        min  median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>    <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int>
1 lang            157ms   160ms      5.97      23MB     1.99     3
2 mget_approach   244ms   255ms      3.92    57.3MB     5.87     2
3 normal          162ms   174ms      5.82    34.4MB     3.88     3

## additional data to make above work: a2 = copy(a); b2 = copy(b)

Here is the source code of where mget is detected and needs to account for it.

https://github.com/Rdatatable/data.table/blob/94a12475f737892c542d3cb7daf42e534ea13a22/R/data.table.R#L1044-L1049

    # added 'mget' - fix for #994
    if (any(c("get", "mget") %chin% av)){
      if (verbose)
        catf("'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.\nOld ansvars: %s \n", brackify(ansvars))
        # get('varname') is too difficult to detect which columns are used in general
        # eval(macro) column names are detected via the  if jsub[[1]]==eval switch earlier above.
Cole
  • 11,130
  • 1
  • 9
  • 24