14

UPDATE: Old question ... it was resolved by data.table v1.5.3 in Feb 2011.

I am trying to use the data.table package, and really like the speedups I am getting, but I am stumped by this error when I do x[y, <expr>] where x and y are "data-tables" with the same key, and <expr> contains column names of both x and y:

require(data.table)
x <- data.table( foo = 1:5, a = 5:1 )
y <- data.table( foo = 1:5, boo = 10:14)
setkey(x, foo)
setkey(y, foo)
> x[y, foo*boo]
Error in eval(expr, envir, enclos) : object 'boo' not found

UPDATE... To clarify the functionality I am looking for in the above example: I need to do the equivalent of the following:

with(merge(x,y), foo*boo)

However according to the below extract from the data.table FAQ, this should have worked:

Finally, although it appears as though x[y] does not return the columns in y, you can actually use the columns from y in the j expression. This is what we mean by join inherited scope. Why not just return the union of all the columns from x and y and then run expressions on that? It boils down to eciency of code and what is quicker to program. When you write x[y,fooboo], data.table automatically inspects the j expression to see which columns it uses. It will only subset, or group, those columns only. Memory is only created for the columns the j uses. Let's say foo is in x, and boo is in y (along with 20 other columns in y). Isn't x[y,fooboo] quicker to program and quicker to run than a merge step followed by another subset step ?

I am aware of this question that addressed a similar issue, but it did not seem to have been resolved satisfactorily. Anyone know what I am missing or misunderstanding? Thanks.

UPDATE: I asked on the data-table help mailing list and the package author (Matthew Dowle) replied that indeed the FAQ quoted above is wrong, so the syntax I am using will not work currently, i.e. I cannot refer to the y columns in the j (i.e. second) argument when I do x[y,...].

Community
  • 1
  • 1
Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
  • But you asked some time ago and it was addressed by v1.5.3 released to CRAN in Feb 2011. Please see it's NEWS, new ?data.table and corrected FAQ. – Matt Dowle Mar 24 '11 at 12:34
  • @Matthew thank you, yes I know it's been addressed by the latest release, and I'm glad you pointed it out here so it's clear to others. – Prasad Chalasani Mar 24 '11 at 14:06

1 Answers1

4

I am not sure if I understand the problem well, and I also just started to read the docs of data.table library, but I think if you would like to get the columns of y and also do something to those by the columns of a, you might try something like:

> x[y,a*y]
     foo boo
[1,]   5  50
[2,]   8  44
[3,]   9  36
[4,]   8  26
[5,]   5  14

Here, you get back the columns of y multiplied by the a column of x. If you want to get x's foo multiplied by y's boo, try:

> y[,x*boo]
     foo  a
[1,]  10 50
[2,]  22 44
[3,]  36 36
[4,]  52 26
[5,]  70 14

After editing: thank you @Prasad Chalasani making the question clearer for me.

If simple merging is preferred, then the following should work. I made up a more complex data to see the actions deeper:

x <- data.table( foo = 1:5, a=20:24, zoo = 5:1 )
y <- data.table( foo = 1:5, b=30:34, boo = 10:14)
setkey(x, foo)
setkey(y, foo)

So only an extra column was added to each data.table. Let us see merge and doing it with data.tables:

> system.time(merge(x,y))
   user  system elapsed 
  0.027   0.000   0.023 
> system.time(x[,list(y,x)])
   user  system elapsed 
  0.003   0.000   0.006 

From which the latter looks a lot faster. The results are not identical though, but can be used in the same way (with an extra column of the latter run):

> merge(x,y)
     foo  a zoo  b boo
[1,]   1 20   5 30  10
[2,]   2 21   4 31  11
[3,]   3 22   3 32  12
[4,]   4 23   2 33  13
[5,]   5 24   1 34  14
> x[,list(x,y)]
     foo  a zoo foo.1  b boo
[1,]   1 20   5     1 30  10
[2,]   2 21   4     2 31  11
[3,]   3 22   3     3 32  12
[4,]   4 23   2     4 33  13
[5,]   5 24   1     5 34  14

So to get xy we might use: xy <- x[,list(x,y)]. To compute a one-column data.table from xy$foo * xy$boo, the following might work:

> xy[,foo*boo]
[1] 10 22 36 52 70

Well, the result is not a data.table but a vector instead.


Update (29/03/2012): thanks for @David for pointing my attention to the fact that merge.data.table were used in the above examples.

daroczig
  • 28,004
  • 7
  • 90
  • 124
  • Referring to the example in my question, I want to do a join of `x` and `y`, let's call it `xy`, and then create a single-column data-frame that is equal to `xy$foo * xy$boo`. – Prasad Chalasani Jan 22 '11 at 00:48
  • @Prasad Chalasani: I edited my answer, I hope you can find something new and valuable in it. – daroczig Jan 22 '11 at 09:02
  • thanks for the details, but my question was about why the specific syntax I describe in my question is not working, contrary to what it says in the FAQ. I know that I can do it in two stages (merge, then operate on columns), but I want the `x[y, ]` syntax to work *in one step* -- i.e. do the join and operate on the `x` and `y` columns in one step. This is syntactically less tedious, and possibly faster (if implemented right internally). I am dealing with 10-million row data-frames, so I'm not concerned with the timings of the small toy example above. – Prasad Chalasani Jan 22 '11 at 15:21
  • @Prasad Chalasani: I see, then my "answer" seems to be no answer :( I suppose the FAQ is just not correct at that part, as @f3lix suggested. – daroczig Jan 22 '11 at 15:59
  • 2
    This was resolved by v1.5.3 released to CRAN in Feb 2011. Please see it's NEWS, new ?data.table and corrected FAQ. – Matt Dowle Mar 24 '11 at 12:35
  • isn't your example using `merge.data.table` and not `base::merge`? – David LeBauer Mar 29 '12 at 14:26
  • Thanks @David, I think *now* you are right. Will update that answer. – daroczig Mar 29 '12 at 21:18