1

Based on this previous post I build leftOuterJoin which is a function to update a data.table X according to an other data.table Y. The function is defined as follows:

leftOuterJoin <- function(X, Y, onCol) {
    .colsY <- names(Y)
    X[Y, (.colsY) := mget(paste0("i.", .colsY)), on = onCol]
}

The function works 99% of the time as intended, e.g.:

X <- data.table(id = 1:5, L = letters[1:5])
   id L
1:  1 a
2:  2 b
3:  3 c
4:  4 d
5:  5 e

Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
   id    L  N
1:  3 <NA> 10
2:  4    g NA
3:  5    h 12

leftOuterJoin(X, Y, "id")
X
   id    L  N
1:  1    a NA
2:  2    b NA
3:  3 <NA> 10
4:  4    g NA
5:  5    h 12

However, for some reason that is unknown to me, it just stops working with some data tables (I have no reproductible example at hand). There is no error, but the data table is not updated. When I use the debug function, everything seems to be working fine, X is updated, but the real data.table isn't. Now, if I just do it outside the function it works. Maybe it is related to the scope of the function? I am really struggling with this problem.

Spec: R v3.5.1 and data.table v1.11.4.

EDIT
Based on the comments I figured out that the problem is related to the data.table pointer. You can reproduce the problem with this code:

> save(X, file = "X.RData")
> load("X.RData")
> leftOuterJoin(X, Y, "id")
> X
   id    L
1:  1    a
2:  2    b
3:  3 <NA>
4:  4    g
5:  5    h

Notice that X is updated but not the way we want it. However, if we use setDT() it works properly:

> load("X.RData")
> setDT(X)
> leftOuterJoin(X, Y, "id")
> X
   id    L  N
1:  1    a NA
2:  2    b NA
3:  3 <NA> 10
4:  4    g NA
5:  5    h 12

Is there a way to set up leftOuterJoin() such that it will not be necessary to run setDT() every time some data is loaded?

mat
  • 2,412
  • 5
  • 31
  • 69
  • 1
    Without reproducible example it is hard to answer. Just one thing: The function is updating also the join column `onCol`. At least, this is not necessary but might be unsafe as well, perhaps. – Uwe Aug 16 '18 at 08:07
  • See also the comments on [this answer](https://stackoverflow.com/a/42539526/3817004) on the question [data.table replace data using values from another data.table, conditionally](https://stackoverflow.com/q/42537520/3817004). – Uwe Aug 16 '18 at 08:13
  • 1
    Looks like a dupe of https://stackoverflow.com/questions/28078640/adding-new-columns-to-a-data-table-by-reference-within-a-function-not-always-wor If your data.table isn't set up to hold at least ncol(X) + ncol(Y) - length(onCol) columns before being passed to the function, the by-reference changes won't propagate outside the function. Maybe I'm wrong, though, since you said there was no error (and also no warning?) – Frank Aug 16 '18 at 14:05
  • @Frank That's correct, there were no warnings nor errors. I'm trying to create an exemple to reproduce the problem, but my best guess right now is that it has to do with the attributes `attr(DT)` of the data.tables. – mat Aug 16 '18 at 14:45
  • @Frank thank you for the reference, I updated my post – mat Aug 16 '18 at 18:36
  • Ok. Re the updated question, I guess the FAQ answers it in the negative..? See "Reading data.table from RDS or RData file" after typing `vignette("datatable-faq")` . Maybe you can add a line inside leftOuterJoin like `if (truelength(X) == 0) stop("setDT on X first")` ... seems to be the only "solution" unless the an enhancement is made to the package https://github.com/Rdatatable/data.table/issues/1017 – Frank Aug 16 '18 at 19:17
  • @Frank would you mind posting a solution? I'm not too sure how to deal with it. – mat Aug 17 '18 at 07:28
  • Hm, still seems like Arun's answer in the link and the FAQ is as close as I could get, so I'd be inclined to close it as a dupe of the question he answered. – Frank Aug 17 '18 at 11:48

0 Answers0