Translating SQL joins on foreign keys to R data.table syntax

Question

The data.table package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time.

In this example, X and Y are data.tables with a single key column "id"; Y also has a non-key column "x_id".

   X <- data.table(id = 1:5, a=4:8,key="id")
   Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id")

The following syntax would join the tables on their keys:

  X[Y]

How can I translate the following SQL syntax to data.table code?

  select * from X join Y on X.id = Y.x_id;

The closest that I have gotten is:

Y[X,list(id, x_id),by = x_id,nomatch=0]

However, this does not do the same inner join as the SQL statement.

Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where X2$y_id = Y2$id.

    X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id")
    Y2 <- data.table(id = 1:5, b = letters[1:5], key="id")

I would like to produce the table:

   id  y_id  b
    1     1 "a"
    2     1 "a"
    3     2 "b"
    4     2 "b"
    5     2 "b"

similar to what is done by the following kludge:

> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id")
  y_id id b
1    1  1 a
2    1  2 a
3    2  3 b
4    2  4 b
5    2  5 b

However, when I do this:

    X2[Y2, 1:2,by = y_id]

I do not get the desired result:

    y_id V1
[1,]    1  1
[2,]    1  2
[3,]    2  1
[4,]    2  2

You can temporarily change `X2` keys and set them to `"y_id"`; then perform a normal join `Y2[X2]` (or `X2[Y2]` depending on the direction) and then restore the previous key of `X2`. — digEmAll, Mar 28 '12 at 20:56
@digEmAll that is useful, I thought that was what the `by` argument does... but can this be applied to a multiple (>2) table join where the target (left) table has >1 foreign key? — David LeBauer, Mar 28 '12 at 21:07
I agree with @digEmAll: `setkey(X2, y_id)` followed by `X2[Y2, nomatch=0]` is all you need for your example. And this should work with several keys as well. However, I'm not really familiar with the foreign key syntax in SQL, so if you are struggling with more keys, could you extend your example? — Christoph_J, Mar 29 '12 at 07:34
"The following syntax would join the tables on their keys: `X[Y]` -- I think the equivalent SQL syntax would be: `SELECT * FROM A NATURAL JOIN B;` — onedaywhen, Mar 29 '12 at 09:29
I edited the top of the question to clear up terminology. Hope ok. Will answer ... — Matt Dowle, Mar 29 '12 at 10:14
@MatthewDowle thank you for your help clarifying the terminology - I was getting stuck trying to translate primary/foreign keys to the keys used in `data.table`. Now if I understand correctly, the 'key' is to set the key for each table as required for a specific join. — David LeBauer, Mar 29 '12 at 14:37
@David No problem. Kind of, but you don't have to set the key of `i` at all (see first example in my answer showing an unkeyed `i`). Secondary keys are on the wish list which should help. People are doing "manual secondary keys" currently; search datatable-help for several threads on that. Also, perhaps `merge.data.table` should gain `by.x` and `by.y` arguments, but anything other than a join to `x`'s key will always be slower. For that reason some people hold two copies of `x` in RAM, with different keys. It depends on the data and task. See FAQs 3.1 - 3.3. — Matt Dowle, Mar 29 '12 at 15:55

Matt Dowle · Accepted Answer · 2012-03-29T21:10:35.910

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
     id y_id
[1,] 11   14
[2,] 12   14
[3,] 13   11
[4,] 14   12
[5,] 15   12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
     id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)]  # binary search for each item of (unsorted and unkeyed) i
     id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
     id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE

thanks. but `identical(Y2[J(X2$y_id)], Y2[X2$y_id]) == TRUE`, is there a difference? — David LeBauer, Mar 29 '12 at 18:08
@David Have changed the example data to make it clearer. The previous example data had key values the same as the row numbers i.e. `1:5`. — Matt Dowle, Mar 29 '12 at 21:15

Translating SQL joins on foreign keys to R data.table syntax

1 Answers1

Linked