data.table join and j-expression unexpected behavior

Question

In R 2.15.0 and data.table 1.8.9:

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value]
#   a value
#   3     4

d[J(3)][, value]
#   4

I expected both to produce the same output (the 2nd one) and I believe they should.

In the interest of clearing up that this is not a J syntax issue, same expectation applies to the following (identical to the above) expressions:

t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]

I would expect both of the above to return the exact same output.

So let me rephrase the question - why is (data.table designed so that) the key column printed out automatically in d[t, value]?

Update (based on answers and comments below): Thanks @Arun et al., I understand the design-why now. The reason the above prints the key is because there is a hidden by present every time you do a data.table merge via the X[Y] syntax, and that by is by the key. The reason it's designed this way seems to be the following - since the by operation has to be performed when merging, one might as well take advantage of that and not do another by if you are going to do that by the key of the merge.

Now that said, I believe that's a syntax design flaw. The way I read data.table syntax d[i, j, by = b] is

take d, apply the i operation (be that subsetting or merging or whatnot), and then do the j expression "by" b

The by-without-by breaks this reading and introduces cases one has to think about specifically (am I merging on i, is by just the key of the merge, etc). I believe this should be the job of the data.table - the commendable effort to make data.table faster in one particular case of the merge, when the by is equal to the key, should be done in an alternative way (e.g. by checking internally if the by expression is actually the key of the merge).

what it does now; also fwiw, `d[J(3), value := 10]` works as expected — eddi, Apr 18 '13 at 21:21
huh? I don't think we understand each other. I think (in this case) `d[3, value]` and `d[J(3), value]` should produce the same result. — eddi, Apr 18 '13 at 21:24
can we please change the title of this question. The behavior is 100% expected and often leveraged. — Ricardo Saporta, Apr 18 '13 at 21:58
@Arun imagine I write a function `fancy_sum(x, y)`, that would compute the sum of `x` and `y` normally, except when `x` is equal to 10, in which case it would multiply. Imagine this is also documented behavior, called *product-instead-of-sum*. While a documented *syntax design choice*, I would call that a clear *syntax design flaw*. — eddi, Apr 19 '13 at 12:36
@Arun, again, please see https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2696&group_id=240&atid=978 — eddi, Apr 19 '13 at 16:32
@Arun, read the comment at the bottom of the FR. If that still doesn't get the point across, I don't know what would. You're talking about how things work, and why supplying different classes of inputs produces different results, while I'm talking about how things *should* work, and how different classes of inputs producing different results for this breaks user expectations. — eddi, Apr 19 '13 at 16:43
This is apparently being changed soon. See @Arun's comment here http://stackoverflow.com/a/20914724/636656 — Ari B. Friedman, Jan 10 '14 at 16:10

Arun · Answer 1 · 2014-03-13T16:53:09.840

11

Edit number Infinity: Faq 1.12 exactly answers your question: (Also useful/relevant is FAQ 1.13, not pasted here).

1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?

Old answer which did nothing to answer the OP's question (from OP's comment), retained here because I believe it does).

When you give a value for j like d[, 4] or d[, value] in data.table, the j is evaluated as an expression. From the data.table FAQ 1.1 on accessing DT[, 5] (the very first FAQ) :

Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.

The first thing, therefore, to understand is, in your case:

d[, value] # produces a "vector"
# [1] 2 3 4 5 6

This is not different when the query for i is a basic indexing like:

d[3, value] # produces a vector of length 1
# [1] 4

However, this is different when i is by itself a data.table. From data.table introduction (page 6):

d[J(3)] # is equivalent to d[data.table(a = 3)]

Here, you are performing a join. If you just do d[J(3)] then you'd get all columns corresponding to that join. If you do,

d[J(3), value] # which is equivalent to d[J(3), list(value)]

Since you say this answer does nothing to answer your question, I'll point where the answer to your "rephrased" question, I believe, lies: ---> then you'd get just that column, but since you're performing a join, the key column will also be output'd (as it's a join between two tables based on the key column).

Edit: Following your 2nd edit, If your question is why so?, then I'd reluctantly (or rather ignorantly) answer, Matthew Dowle designed so to differentiate between a data.table join-based-subset and a index-based-subsetting operation.

Your second syntax is equivalent to:

d[J(3)][, value] # is equivalent to:

dd <- d[J(3)]
dd[, value]

where, again, in dd[, value], j is evaluated as an expression and therefore you get a vector.

To answer your 3rd modified question: for the 3rd time, it's because it is a JOIN between two data.tables based on the key column. If I join two data.tables, I'd expect a data.table

From data.table introduction, once again:

Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.

edited Mar 13 '14 at 16:53

answered Apr 18 '13 at 21:53

Arun

116,683
26
284
387

Your comments helped me pinpoint where exactly my expectation breaks, so see OP edit. As is, this answer does nothing for my question. – eddi Apr 18 '13 at 22:02
Feel free to down-vote then. If anyone else who understands your question *and* my answer tells me that this answer *does nothing* to help your question, then I'd be happy to delete it. – Arun Apr 18 '13 at 22:06
@eddi I think this actually answers your question rather precisely. – joran Apr 18 '13 at 22:08
1

my question is **why** is key printed out, not the explanation of **how** it's printed out! – eddi Apr 18 '13 at 22:09
as in the design-why, not the how-why, I really don't know how else to state it :) – eddi Apr 18 '13 at 22:11
@Arun, yes `A[B]` returns a `data.table`, so does `A`, does that mean that when you type `A[, value]` you expect a `data.table`? No. In any other usage of `i` except to join, you **don't** get a `data.table` when you write `A[i-expression, value]`. This is why `A[B, value]` breaks my expectation and I don't think you explained that design decision in your edits yet. – eddi Apr 18 '13 at 22:24
@eddi, when you do `merge` on two data.frames, do you expect the common columns to be returned? – Arun Apr 18 '13 at 22:25
yes, same as when I do `A[B]`. However `merge` doesn't have the `j` syntax for me to demonstrate `A[B, value]`, so I don't carry over any expectations from there. – eddi Apr 18 '13 at 22:26
`A[, value]` is *not* a `data.table` join. Why would I expect a `data.table`? `A[DT, value]` **is a join (merge) of two data.tables**, why wouldn't I expect a data.table? It's like asking, I merge two `data.frame`s and I don't get a vector... – Arun Apr 18 '13 at 22:26
Because for any other **i-expression**, `A[i, value]` is **not** a `data.table`. – eddi Apr 18 '13 at 22:28
So why do you expect a `data.table` when `i` is NOT a data.table as it is NOT a join/merge of two data.tables? – Arun Apr 18 '13 at 22:28
huh, wat? I don't expect a `data.table` from **any** `A[i, value]`, yet I get it when `i` is a `data.table`. – eddi Apr 18 '13 at 22:29
When `i` is a `data.table`, it's equivalent to joining two `data.tables`. Do you agree/understand? – Arun Apr 18 '13 at 22:30
If so, why wouldn't you expect the output to be a data.table? Why are you surprised that the join of two data.tables is a data.table? – Arun Apr 18 '13 at 22:31
:) this is an infinite loop, see 6 posts above – eddi Apr 18 '13 at 22:32
Here's how I read `d[i,j]` - take `d`, subset or join or do some other magic involving `i`, then do `j` for the result – eddi Apr 18 '13 at 22:33
X[Y] **is** a merge (also implemented by `merge.data.table` method). As such, I expect the intersecting column. – Arun Apr 18 '13 at 22:34
in `X[Y]`, I agree wholeheartedly, in `X[Y, value]`, emphatically no, for reasons two comments above – eddi Apr 18 '13 at 22:35
How about: `X[Y, list(value)]`? – Arun Apr 18 '13 at 22:35
Same logic, the output should be the same as `X[Y][, list(value)]` - take `X`, do `Y`, return `list(value)` – eddi Apr 18 '13 at 22:36
One last question: when you do `d[, list(.), by = .]`, why do you expect the columns in `by` to be in the result as well? Why not just the `list(.)` entries?? – Arun Apr 18 '13 at 22:39
if you want expectations - because that's what `by` does, it doesn't break my `d[i,j]` expectation, because there is an extra `by`; as a design decision it's pretty obvious too, as most any usage of `by` requires having the `by` column at the end. This is not the case for `X[Y, j]` - at least I'm having a very hard time thinking of a usage case where I'd want to have the key there if I didn't request it explicitly in `j`. – eddi Apr 18 '13 at 23:12
Wow... I am now really interested in hearing what @MatthewDowle (sorry to drag you into this mess) has to say! It *clearly* answers your question! `Why merge all the columns of data, only to use a small subset of them afterwards?` – Arun Apr 18 '13 at 23:15
Finally getting somewhere (@Arun, I do appreciate the effort btw, even if we do/did speak past each other a lot) - from FAQ 1.13 - it seems like it's there because of the *by without by* or *grouping by i* feature, which at the moment doesn't make much sense to me, but explanations are welcome :) – eddi Apr 18 '13 at 23:20
@Arun, this is not about what to merge or how, but what to output at the end, you're again going somewhere else :) – eddi Apr 18 '13 at 23:22
and 1.14 cements the point of 1.13; I'm still confused about it, maybe tomorrow it will become more clear – eddi Apr 18 '13 at 23:24
Hmm, seems I've failed to understand every step of the way. I've tagged MatthewDowle. I'll leave it to the author of the package to take it from here. I'm of course interested to figure out what I (and arguably some of the `data.table` users here on SO) seem to have missed. – Arun Apr 18 '13 at 23:25
@Arun, while I think my question has been answered thanks to this discussion (and has been reduced to me wanting to submit a feature change request and not knowing how to do that or if that would even have any effect), I don't think your answer specifically answers it; so +1 and thanks for the effort and I consider the OP closed, but I don't think it makes sense to give your answer the checkmark – eddi Apr 19 '13 at 01:11
@eddi, shall we cleanup this comment section as well (before to delete the answer)? – Arun Mar 13 '14 at 16:23
@Arun I think this answer helped me out quite a bit in understanding the problem, and has a well-deserved +1 from me, so I'd vote against deleting it :) – eddi Mar 13 '14 at 16:28

eddi · Accepted Answer · 2014-03-14T15:18:42.987

6

As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value]
#[1] 4

d[J(3), value, by = .EACHI]
#   a value
#1: 3     4

And here's a slightly more complicated example, illustrating the difference:

d = data.table(a = 1:2, b = 1:6, key = 'a')
#   a b
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6

# normal join
d[J(c(1,2)), sum(b)]
#[1] 21

# join with a by-without-by, or by-each-i
d[J(c(1,2)), sum(b), by = .EACHI]
#   a V1
#1: 1  9
#2: 2 12

# and a more complicated example:
d[J(c(1,2,1)), sum(b), by = .EACHI]
#   a V1
#1: 1  9
#2: 2 12
#3: 1  9

edited Mar 14 '14 at 15:18

answered Mar 14 '14 at 15:10

eddi

49,088
6
104
155

Since it's very likely that this post'll be *the* place to be looked at for this change, it'd make sense answer in detail. I'll also try to edit when I've some time. – Arun Mar 14 '14 at 15:12
@Arun I added a more complicated example, obviously feel free to edit – eddi Mar 14 '14 at 15:19
1

@Arun This change has been a major disrupt for me. I agree that from a new user perspective this might be more intuitive, but, wouldn't it be better if by = .EACHI is kept as the default for backwards compatibility? At least, it would be very important to give a big heads up to all the other packages that depend on data.table – Juancentro Jun 04 '14 at 21:41

score 4 · Answer 3 · answered Apr 19 '13 at 00:04

4

This is not unexpected behaviour, it is documented behaviour. Arun has done a good job of explaining and demonstrating in the FAQ where this is clearly documented.

there is a feature request FR 1757 that proposes the use of the drop argument in this case

When implemented, the behaviour you want might be coded

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value, drop = TRUE]

answered Apr 19 '13 at 00:04

mnel

113,303
27
265
254

thanks, I understand the design why now (which is, if you support by-without-by, then that's a hidden `by`, and so the `by` column should be there), and so far the conclusion of all this (for me) seems to be that by-without-by is a bad syntax feature to have (as opposed to manually specifying a `by` if you want to and default behavior of `X[Y,j]` being the same as `X[Y][,j]`). I'm very open to be convinced otherwise by some examples. – eddi Apr 19 '13 at 00:39
1

`by-without-by` is **significantly** faster than `by` or even a keyed `by`. I think an extra couple of key strokes is worth it. If you want to have a long drawn out discussion, move to the data.table mailing list. If you want to comment on the feature request for drop = TRUE, do so. – mnel Apr 19 '13 at 00:48
then that feature should come as an extra option, but not at a cost to normal syntax and understanding; anyway, my original question is answered I guess; fwiw another inconsistency in this by-without-by business is that the hidden `by` column is actually unavailable in the `j` expression, so that `d[J(3), a]` fails – eddi Apr 19 '13 at 01:06
2

by-without-by **is** normal data.table syntax . You could put in a feature request to remove by-without-by, but I think the problem is your understanding and expectations, which (clearly) are out of step with the data.table author (and the majority of users). `d[.(3), a]` is an interesting case, I'm not sure how that gets the result it does. If you name within `i` it (can) work `d[.(a=3), a]`. Perhaps this is worth a question by itself (on the mailing list perhaps) – mnel Apr 19 '13 at 01:18
@eddi -- I have created [FR 2693](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978) to automatically use the key column names when an unnamed list is passed to i – mnel Apr 19 '13 at 02:04
1

@mnel +1 for FR#2693. Will try and implement that. – Matt Dowle Apr 19 '13 at 11:01

score 3 · Answer 4 · answered Apr 18 '13 at 23:39

3

I agree with Arun's answer. Here's another wording: After you do a join, you often will use the join column as a reference or as an input to further transformation. So you keep it, and you have an option to discard it with the (more roundabout) double [ syntax. From a design perspective, it is easier to keep frequently relevant information and then discard when desired, than to discard early and risk losing data that is difficult to reconstruct.

Another reason that you'd want to keep the join column is that you can perform aggregate operations at the same time as you perform a join (the by without by). For example, the results here are much clearer by including the join column:

d <- data.table(a=rep.int(1:3,2),value=2:7,other=100:105,key="a")
d[J(1:3),mean(value)]
#   a  V1
#1: 1 3.5
#2: 2 4.5
#3: 3 5.5

answered Apr 18 '13 at 23:39

Blue Magister

13,044
5
38
56

I'm going to have think about this, but atm I don't quite understand the need for `d[J(1:3), mean(value)]` syntax for getting that result vs what I would consider much more palatable option of using `d[J(1:3), mean(value), by = a]` syntax (I actually haven't checked if this works, but I would *want* it to work this way). I guess I don't understand the need (and want) for the *by without by* syntax, as in my case all it does is just confuse with very little apparent benefit. – eddi Apr 19 '13 at 00:31
1

As @Simon101 quotes Prof. Ripley, "*Your preferences are irrelevant. It's the documentation that counts*" (and that *does not* fail to give explainations). That said, MatthewDowle is very nice and receptive and he may play along. I'm perfectly happy the way it is. – Arun Apr 19 '13 at 06:54
1

@eddi The motivation for _by-without-by_ is speed when you have a _subset_ of known groups. Say you have 1000 groups (each with 1000 rows and 100 columns), but you only want the mean of one column for only 6 groups. A subset of all columns for those 6 groups, followed by a `by` would be slower than a direct _by-without-by_ with fancy `j` column usage inspection. See FAQ 1.12 for more info on the thinking here. – Matt Dowle Apr 19 '13 at 11:10
1

@MatthewDowle I have nothing against the *concept* of by without by, my complaint is against its *syntax*. I believe the syntax should be "normal" by, and all of the magic should happen internally. – eddi Apr 19 '13 at 12:41
@eddi I see where you're coming from now, I think. In my case what I do a lot is `DT[myids,mean(stuff)]` or `DT[myids,list(mean(stuff),sd(stuff),...)]`. I like `data.table` giving me the aggregates for each `id` automatically, because that's what I need most of the time. I think I'd find having to set `by=` as well too onerous. – Matt Dowle Apr 19 '13 at 14:02
@eddi Also, _by-without-by_ only happens when `mult='all'`. I sometimes set `mult='first'`, or use `DT[myids][,mean(stuff)]` as the FAQ says. There is also a feature request in this area somewhere to be done. – Matt Dowle Apr 19 '13 at 14:12
@MatthewDowle, I created a feature request to change this behavior https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2696&group_id=240&atid=978 , I tried to explain there, as clearly as I could, why it broke my expectations and why I think current behavior is more complicated than it should be – eddi Apr 19 '13 at 15:30
@eddi, I'm a bit confused. How do you suggest to implement the concept of *by-**without**-by* using *normal by*? I guess you mean by `normal by` that, it should be of the form: `d[i, j, by=.]`. How will this be *by-**without**-by*? Could you elaborate? To me, the concept of `by-without-by` shouldn't involve `by` (as the name suggests). Since you have nothing against the concept, this confuses me. – Arun Apr 19 '13 at 16:16
@Arun, what I mean is that by-without-by is there because of performance improvements when you merge and then do a by by the key. That concept of doing both at the same time is a great one and will save time. But the syntax to achieve that behavior should not compromise the understanding of what `d[i,j,by=b]` means. – eddi Apr 19 '13 at 16:19
So basically you don't want the concept of `by-without-by`... Clearly your suggestion is *not* `by-without-by`. – Arun Apr 19 '13 at 16:21
1

I want the concept of speeding up when you by by the key, but not the current syntax. I don't want silent by's. – eddi Apr 19 '13 at 16:21
@Arun please see the feature request, I tried to be more clear there than I can be in this comment-space. – eddi Apr 19 '13 at 16:28
@eddi An `i` join with `mult='all'` is like `CROSS APPLY` in SQL, if that helps? If the same group appears in `i` twice then the same aggregate will be returned for both, in the same order as `i` (useful sometimes). Not sure how an `i` then `by` would work in this case (it may consider the two `i` dups as one?) – Matt Dowle Apr 19 '13 at 17:40
@eddi How about an option to turn off `by-without-by`? In the usual way it would be a new argument to `[.data.table` with default taken from the global option. And that would simplify the answer to FAQ 1.14 nicely, too. – Matt Dowle Apr 19 '13 at 17:49
@MatthewDowle I don't really know SQL enough to draw an analogy. If you're going to go the option route, I think, ideally, turning *on* `by-without-by` syntax should be optional and not *off*. I get that this is version 1.8.9 and the change could be disruptive and that's one of your concerns. If introducing the option first, and then flipping it around a few versions later is the smoother way of doing it - that's cool, but I do believe that the end state should be that the default reading of `d[i,j,by=b]` is `take d, apply i, return j by b`, or `take d, apply i, return j` if there is no `by`. – eddi Apr 19 '13 at 18:32
@eddi Im happy to make it optional, since there's no downside really. Changing the default would need more people (on datatable-help) to agree with you. A full discussion there would be useful. – Matt Dowle Apr 19 '13 at 19:32
@MatthewDowle ok, I submitted a post to that mailing-list – eddi Apr 19 '13 at 19:54

data.table join and j-expression unexpected behavior

4 Answers4

Linked