27

I have two data.tables, X (3m rows by ~500 columns), and Y (100 rows by two columns).

set.seed(1)
X <- data.table( a=letters, b=letters, c=letters, g=sample(c(1:5,7),length(letters),replace=TRUE), key="g" )
Y <- data.table( z=runif(6), g=1:6, key="g" )

I want to do a left outer join on X, which I can do by Y[X] thanks to:

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

But I want to add the new column to X without copying X (since it's huge).

Obviously, something like X <- Y[X] works, but unless data.table is far cleverer than I give it credit for (and I give it credit for quite a lot of deviousness!), I believe this copies the whole of X.

X[ , z:= Y[X,z]$z ] works, but is kludgy and doesn't scale well to more than one column.

How do I store the results of a merge back into the retained data.table in an efficient (both in terms of copies and in terms of programmer time) way?

smci
  • 32,567
  • 20
  • 113
  • 146
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • 1
    you don't need to do `Y[X,z]` (and will possibly run into problems doing that if you forget about by-without-by), just `X[, z := Y[X]$z]` works and seems to be faster for this example; although ultimately `X = Y[X]` is by far the fastest of the different expressions I've tried so far – eddi Oct 23 '13 at 22:19
  • Interesting. I had the `,z` in there because I thought that would give DT info about what variables it needed to retain since it optimizes on that. But your (deleted) point is worth copying here: "watch out for hidden by-without-by when doing smth like `Y[X,z]`." Even if it's fast, if `X = Y[X]` creates a copy I'm potentially in trouble.... – Ari B. Friedman Oct 23 '13 at 22:29
  • I see, so your concern is more about memory usage and less speed; in that case I think what you suggested, in the form that you suggested, by-without-by's notwithstanding (ok here I think - I always get confused about it), is probably the way to go – eddi Oct 23 '13 at 22:34
  • Do I have to worry about by if I do `Y[X,list(z)]` instead? – Ari B. Friedman Oct 23 '13 at 22:40
  • No, you'll only need to worry if you do anything other than list all elements, or more specifically, when the operation is different when done by group – eddi Oct 23 '13 at 22:42
  • Got it. Still leaves `X[, z := Y[X]$z]` still leaves me with the problem of doing it for many variables, unfortunately. – Ari B. Friedman Oct 23 '13 at 22:51
  • 3
    Related and also answered by eddi albeit without anyone finding the documentation: http://stackoverflow.com/questions/16843728/accessing-y-columns-with-duplicated-names-in-j-of-xy-j-merges – Frank Oct 26 '13 at 09:01

2 Answers2

33

This is easy to do:

X[Y, z := i.z]

It works because the only difference between Y[X] and X[Y] here, is when some elements are not in Y, in which case presumably you'd want z to be NA, which the above assignment will exactly do.

It would also work just as well for many variables:

X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]

Since you require the operation Y[X], you can add the argument nomatch=0 (as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:

X[Y, z := i.z, nomatch=0]

From the NEWS for data.table

    **********************************************
    **                                          **
    **   CHANGES IN DATA.TABLE VERSION 1.7.10   **
    **                                          **
    **********************************************

NEW FEATURES

o   The prefix i. can now be used in j to refer to join inherited
    columns of i that are otherwise masked by columns in x with
    the same name.
Arun
  • 116,683
  • 26
  • 284
  • 387
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 4
    Many thanks. That works (even with my edited example data), but I don't understand why. Is there a name for the `i.` notation? I don't see it anywhere in `?[.data.table`.... – Ari B. Friedman Oct 23 '13 at 22:59
  • 3
    I think the `i.` usage is poorly documented and I forget where it is right now - but it's simply used to refer to columns of the `i-expression` data.table – eddi Oct 23 '13 at 23:00
  • 1
    I get it. Still a little confused as to why `X[Y]` (right outer join) produces such different results from `X[Y, z:= i.z]` (left outer join with assignment). – Ari B. Friedman Oct 23 '13 at 23:02
  • they both produce the same result as far as the join goes, and in the second case, the resulting `z` from `Y` is assigned to a new column in `X` (which automatically produces `NA`'s for cases where there was no match) – eddi Oct 23 '13 at 23:04
  • Ah I get it. Very clever. Gist of me testing it is here: https://gist.github.com/gsk3/7128419 – Ari B. Friedman Oct 23 '13 at 23:06
  • 2
    @AriB.Friedman -- it is in the NEWS for version 1.7.10 (I've added a link). For the differences, it comes down to the fact you can't add rows by reference (yet), so `X[Y, z:=i.z]` will be the equivalent of setting `z=i.z` X[Y, z:=i.z, nomatch=0]` – mnel Oct 23 '13 at 23:08
  • `nomatch=0` is not meaningful during an assignment, as indicated by the warning shown when running that line of code. – Frank Mar 03 '17 at 21:02
9

As an addition to the answer above, you can also do (v1.9.6+):

require(data.table) # v1.9.6+
X[Y, (colNames) := mget(paste0("i.", colNames))]

where colNames is a character vector listing the columns you want from Y. This lets you efficiently select columns to add (define colNames from a subset of names(Y)) in the case you are adding many columns.

Also, you can combine it with the new on= argument (from v1.9.6+) as:

# ad-hoc joins using 'on=' instead of setting keys
require(data.table) # v1.9.6+
X[Y, (colNames) := mget(paste0("i.", colNames)), on = "g"]

Credit to akrun for the (colNames) := mget(colNames) strategy here: Update rows of data frame in R.

Community
  • 1
  • 1
Alexander Li
  • 307
  • 1
  • 3
  • 9
  • 2
    Actually not akrun, rather the guy above you introduced this here: https://stackoverflow.com/a/30469832/3001626 . So basically both of the answers on this thread are eddi's – David Arenburg Jul 10 '17 at 07:00