20

The data.table package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time.

In this example, X and Y are data.tables with a single key column "id"; Y also has a non-key column "x_id".

   X <- data.table(id = 1:5, a=4:8,key="id")
   Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id")

The following syntax would join the tables on their keys:

  X[Y]

How can I translate the following SQL syntax to data.table code?

  select * from X join Y on X.id = Y.x_id; 

The closest that I have gotten is:

Y[X,list(id, x_id),by = x_id,nomatch=0]

However, this does not do the same inner join as the SQL statement.


Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where X2$y_id = Y2$id.

    X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id")
    Y2 <- data.table(id = 1:5, b = letters[1:5], key="id")

I would like to produce the table:

   id  y_id  b
    1     1 "a"
    2     1 "a"
    3     2 "b"
    4     2 "b"
    5     2 "b"

similar to what is done by the following kludge:

> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id")
  y_id id b
1    1  1 a
2    1  2 a
3    2  3 b
4    2  4 b
5    2  5 b

However, when I do this:

    X2[Y2, 1:2,by = y_id]

I do not get the desired result:

    y_id V1
[1,]    1  1
[2,]    1  2
[3,]    2  1
[4,]    2  2
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
David LeBauer
  • 31,011
  • 31
  • 115
  • 189
  • 2
    You can temporarily change `X2` keys and set them to `"y_id"`; then perform a normal join `Y2[X2]` (or `X2[Y2]` depending on the direction) and then restore the previous key of `X2`. – digEmAll Mar 28 '12 at 20:56
  • @digEmAll that is useful, I thought that was what the `by` argument does... but can this be applied to a multiple (>2) table join where the target (left) table has >1 foreign key? – David LeBauer Mar 28 '12 at 21:07
  • I agree with @digEmAll: `setkey(X2, y_id)` followed by `X2[Y2, nomatch=0]` is all you need for your example. And this should work with several keys as well. However, I'm not really familiar with the foreign key syntax in SQL, so if you are struggling with more keys, could you extend your example? – Christoph_J Mar 29 '12 at 07:34
  • "The following syntax would join the tables on their keys: `X[Y]` -- I think the equivalent SQL syntax would be: `SELECT * FROM A NATURAL JOIN B;` – onedaywhen Mar 29 '12 at 09:29
  • I edited the top of the question to clear up terminology. Hope ok. Will answer ... – Matt Dowle Mar 29 '12 at 10:14
  • @MatthewDowle thank you for your help clarifying the terminology - I was getting stuck trying to translate primary/foreign keys to the keys used in `data.table`. Now if I understand correctly, the 'key' is to set the key for each table as required for a specific join. – David LeBauer Mar 29 '12 at 14:37
  • @David No problem. Kind of, but you don't have to set the key of `i` at all (see first example in my answer showing an unkeyed `i`). Secondary keys are on the wish list which should help. People are doing "manual secondary keys" currently; search datatable-help for several threads on that. Also, perhaps `merge.data.table` should gain `by.x` and `by.y` arguments, but anything other than a join to `x`'s key will always be slower. For that reason some people hold two copies of `x` in RAM, with different keys. It depends on the data and task. See FAQs 3.1 - 3.3. – Matt Dowle Mar 29 '12 at 15:55

1 Answers1

18

Good question. Note the following (admittedly buried) in ?data.table :

When i is a data.table, x must have a key. i is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge of the two tables is carried out.

So, the key here is that i doesn't have to be keyed. Only x must be keyed.

X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
     id y_id
[1,] 11   14
[2,] 12   14
[3,] 13   11
[4,] 14   12
[5,] 15   12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
     id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)]  # binary search for each item of (unsorted and unkeyed) i
     id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b

or,

Y2[SJ(X2$y_id)]  # binary merge of keyed i, see ?SJ
     id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d

identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224