22

This is something that I thought I should ask following this question. I'd like to confirm if this is a bug/inconsistency before filing it as a such in the R-forge tracker.

Consider this data.table:

require(data.table)
DT <- data.table(x=c(1,0,NA), y=1:3)

Now, to access all rows of the DT that are not 0, we could do it in these ways:

DT[x != 0]
#    x y
# 1: 1 1
DT[!(x == 0)]
#     x y
# 1:  1 1
# 2: NA 3

Accessing DT[x != 0] and DT[!(x==0)] gives different results when the underlying logical operation is equivalent.

Note: Converting this into a data.frame and running these operations will give results that are identical with each other for both logically equivalent operations, but that result is different from both these data.table results. For an explanation of why, look at ?`[` under the section NAs in indexing.

Edit: Since some of you've stressed for equality with data.frame, here's the snippet of the output from the same operations on data.frame:

DF <- as.data.frame(DT)
# check ?`[` under the section `NAs in indexing` as to why this happens
DF[DF$x != 0, ]
#     x  y
# 1   1  1
# NA NA NA
DF[!(DF$x == 0), ]
#     x  y
# 1   1  1
# NA NA NA

I think this is an inconsistency and both should provide the same result. But, which result? The documentation for [.data.table says:

i ---> Integer, logical or character vector, expression of column names, list or data.table.

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.

It's clear why the results are different from what one would get from doing the same operation on a data.frame. But still, within data.table, if this is the case, then both of them should return:

#    x y
# 1: 1 1

I went through [.data.table source code and now understand as to why this is happening. See this post for a detailed explanation of why this is happening.

Briefly, x != 0 evaluates to "logical" and NA gets replaced to FALSE. However, !(x==0), first (x == 0) gets evaluated to logical and NA gets replaced to FALSE. Then the negation happens, which results in NA basically becoming TRUE.

So, my first (or rather main) question is, is this a bug/inconsistency? If so, I'll file it as one in data.table R-forge tracker. If not, I'd like to know the reason for this difference and I would like to suggest a correction to the documentation explaining this difference (to the already amazing documentation!).

Edit: Following up with comments, the second question is, should data.table's handling for subsetting by indexing with columns containing NA resemble that of data.frame?? (But I agree, following @Roland's comment that this may be very well lead to opinions and I'm perfectly fine with not answering this question at all).

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    (+1) Interesting. I expected `DT` to behave same as `DF`! – Nishanth Apr 26 '13 at 14:59
  • 5
    My vote is for a bug, because I'd like `data.table` objects to behave exactly the way `data.frame` objects do. – Carl Witthoft Apr 26 '13 at 15:01
  • 2
    This question seems to ask for voting/opinions a bit to much for my taste. – Roland Apr 26 '13 at 15:40
  • @Roland, the results are different for an equivalent operation. Is it a bug or not? How is this an opinion? I don't follow. – Arun Apr 26 '13 at 15:46
  • 2
    I strongly suspect that it is deliberate, not a bug; and I would also like to see documentation/explanation for it. Now that I understand it (thanks to your explanation :) ), I sort of like the current behavior. I'll probably change my mind when I forget it and make a mistake because of it, though. To anyone who can edit: That help query can be made correct with judicious use of spaces and double-backticks: ``?`[` ``. Also, the title is missing a ")". – Frank Apr 26 '13 at 17:02
  • 2
    @Arun If it is a bug or a feature seems subjective to me. It's not the only example, where a data.table behaves different to a data.frame. – Roland Apr 26 '13 at 17:09
  • 4
    @Roland, I think you've *not* fully understood/read the post. My qualms are *not* about the differences between data.table and data.frame per-se (I just added that point following e4e5f4 and Carl's comment). My main question is about the differences *within* `data.table` **between** `dt[x != .]` and `dt[!(x==.)]` When these are seemingly equivalent operations. I've made this point bold in my question now. – Arun Apr 26 '13 at 18:11
  • 1
    Doesnt this come from the leading ! Being recognized as a not join? – mnel Apr 26 '13 at 21:29
  • @mnel, yes, you're right for the 2nd case, there `notjoin = TRUE`. So, the part (x == .) gets evaluated and NA replaced to FALSE. Then, the notjoin condition is checked and since it's true, it inturn provides the opposite (which makes the NA TRUE). But does it say anything about the behaviour? I mean is it acceptable because it's not recognised as *not-join*? – Arun Apr 26 '13 at 23:08

4 Answers4

7

I think it is documented and consistent behaviour.

The main thing to note is that the prefix ! within the i argument is a flag for a not join, so x != 0 and !(x==0) are no longer the same logical operation when working with the documented handling of NA within data.table

The section from the news regarding the not join

A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
            DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
            DT[!"a"]                              # same result, now preferred.
            DT[!J(6),...]                         # !J == not-join
            DT[!2:3,...]                          # ! on all types of i
            DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)
            DT[!J(6L,23L)]                        # same result, faster binary search
        '!' has been used rather than '-' :
            * to match the 'not-join'/'not-where' nomenclature
            * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
              compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
              base R) and after this new feature.
            * to leave DT[+J...] and DT[-J...] available for future use

And from ?data.table

All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present. See examples.


Why is it consistent with the documented handling of NA within data.table

NA values are considered FALSE. Think of it like doing isTRUE on each element.

so DT[x!=0] is indexed with TRUE FALSE NA which becomes TRUE FALSE FALSE due to the documented NA handling.

You are wanting to subset when things are TRUE.

This means you are getting those where x != 0 is TRUE ( and not NA)

DT[!(x==0)] uses the not join states you want everything that is not 0 (which can and will include the NA values).


follow up queries / further examples

DT[!(x!=0)]

## returns
    x y
1:  0 2
2: NA 3

x!=0 is TRUE for one value, so the not join will return what isn't true. (ie what was FALSE (actually == 0) or NA

DT[!!(x==0)]

## returns
    x y
1:  0 2
2: NA 3

This is parsed as !(!(x==0)). The prefix ! denotes a not join, and the inner !(x==0) is parsed identically to x!=0, so the reasoning from the case immediately above applies.

mnel
  • 113,303
  • 27
  • 265
  • 254
  • mnel, thanks for the wonderful explanation. As much as I find difficult to wrap my head around the fact that `!(x==.)` and `x != .` aren't equivalent (especially without a `J` present, I assumed a not-join is `!J(.)`), your explanation makes much sense to me. One more question. So with this behaviour, what would you expect the output for `DT[!!(x==0)]` should be and is it the same as the expected behaviour? – Arun Apr 27 '13 at 06:52
  • Basically, it'd be great if you could clarify how data.table interprets `DT[!(x!=0)]` and `DT[!!(x==0)]`. I've trouble interpreting it. – Arun Apr 27 '13 at 07:12
  • +1. Very informative. I find it useful to think of the prefix `!` as selecting the _complement_ of whatever it operates on (as "complement" is a part of math lingo, while "not-join" is a completely foreign term to me). This applies to logical vectors, vectors of indices and `J()` as well. – Frank Apr 27 '13 at 16:31
  • That's a good explanation of what's going on, but calling this either documented or consistent is really pushing the meanings of both words. Another example of interest is `DT[(!(x == 0))]` - it's pretty clear what the answer will be from @mnel's explanation, but it'll probably cause dissonance for most that the result is different from `DT[!(x == 0)]`. – eddi Apr 29 '13 at 20:21
  • 2
    How is this *not* documented? It is in the help (with the relevant section added to this answer and in the news. Using the leading `(` to stop the not join from being triggered *is* a great way of doing so -- and perhaps could be explicitly documented as such (using `()` juidiciously is becoming a [data.table idiom](http://stackoverflow.com/questions/16191083/subset-data-table-by-logical-column/16191749#comment23149196_16191749) – mnel Apr 29 '13 at 23:08
  • @mnel - I've reread that piece of news you copy-pasted a few times and even knowing what the right answer is, I still don't think the result of any of the above examples would've been clear without going ahead and running them first. In particular, the fact that `DT[!` is THE important piece of syntax and not anything else is not made clear there at all. – eddi Apr 30 '13 at 15:32
  • In other words, yes, there is a clear correspondence from current reality to that news, but not the other way around - reading that news section will not uniquely lead one to current behavior (and it's the latter that I'd call "documented" behavior). – eddi Apr 30 '13 at 15:34
  • @eddi -- I don't understand your last comment. The extract from `?data.table` clearly state *All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed.* If you can think of a clearer way of documenting this, please put in a feature request including your suggested improvements. – mnel Apr 30 '13 at 22:51
  • @mnel, one of the problems with that description is that the ! is not attached to the `i`, but is instead attached to the bracket `[` (with optional spaces in between). The example I gave above - `DT[(!i)]` illustrated that issue - it's not about *pre*-fixing the `i` with a !, but rather *post*-fixing the `[` with a !. – eddi Apr 30 '13 at 22:55
  • @eddi that comes down to the fact you haven't understood how R parses arguments. `DT[(!i)]` gets matched positionally and parsed `DT[i = (!i)]` -- the prefix for the `i` argument is clearly `(`. – mnel Apr 30 '13 at 23:08
  • @mnel I understand what R does here quite well, thanks for inquiring. What you can't seem to grasp though is that that behavior is not the only interpretation of the documentation. Something like `(blah)` being different from `blah` is going to surprise almost anyone unless that difference is very explicitly spelled out. And yeah, while that behavior is clearly possible to implement, I can't think off the top of my head of any other non data table function that does this. – eddi May 01 '13 at 05:37
  • @eddi -- perhaps it comes down to the interpretation of *prefix* then. `.()` within `i` gets recognized in a similar manner within `[.data.table`. It is easy to forget that `(` is a function as well -- My point about parsing arguments was to highlight It isn't post fixing `[`, because you could have `DT[j = jBlah, i = !(iBlah)]` (perfectly valid usage, because `[.data.table` isn't a primitive function that only positionally matches). – mnel May 01 '13 at 06:12
  • @mnel good point, I agree, post-fixing is also a bad description of what's happening. It's more like literal string processing of 'i'. I'd have to think how to describe this best. – eddi May 01 '13 at 12:30
  • Maybe it should say just that - if literally the very first letter in 'i' is !, then ... – eddi May 01 '13 at 12:36
  • And yeah, I don't think this is same as 'prefixing' i, because that omits the important fact that this will be processed as a string to check for prefixes, rather than as an expression, since in the latter case I would argue that extra parentheses *shouldn't* matter. – eddi May 01 '13 at 12:43
  • I have confused myself now and have a question - why does this look like literal string processing of prefix? It's not under the hood, right? – eddi May 01 '13 at 12:55
  • ok, I guess that appearance is there because, as @mnel mentions, `(` is a function, but that doesn't make it any more satisfactory. Even worse is that `{` behaves the same way. Another thing one might try is assign `t = quote(!(x == 0))` and then call `DT[eval(t)]` to yet again not get a not-join. So yeah, atm I think the best way to have this in documentation is to say that the literal string `i` has to have a `!` as the first character to do a not-join. And I don't think "prefixing i" expresses that meaning. – eddi May 01 '13 at 15:58
  • @eddi: This won't address your concern with the documentation, but Arun has explained why it works this way in terms of the code: `isub = substitute(i)` followed by a check `if (is.call(isub) && isub[[1L]] == as.name("!")){notjoin = TRUE...}`. http://stackoverflow.com/questions/16221742/subsetting-a-data-table-by-a-column-without-losing-nas/16222108#16222108 This is what it means for `i` to be prefixed by `!`. – Frank May 01 '13 at 17:12
  • @Frank, thanks, so it *does* literally look at first letter of `i` - I don't think current documentation expresses that (i.e. that "this is what it means for `i` to be prefixed by `!`" is only clear after looking at the code/running a lot of different examples, but not just from documentation). On a slightly different note - is this how people actually *want* `!` to work? – eddi May 01 '13 at 17:40
  • aahhhh, no it's not, I'm going in loops around this :) looks like `substitute` parses the expression and returns a tree and `isub[[1L]]` return root of the tree, so it still only *looks* like it's literal string processing – eddi May 01 '13 at 17:46
  • @mnel and @Frank - all right, now that I understand what exactly it does underneath, here's what I think is a good example that shows how current behavior is at best undocumented: `DT[! x & 1]` - I encourage you to guess the answer before running the code, and then go ahead and interpret the result as (not)prefixing by `!`. – eddi May 01 '13 at 17:55
  • I would hope that that is treated the same as `(!x) & 1` or equivalently `x==F`, since `&` should take precedence? Ok, I see that it does: `isub<-substitute(! x & 1); as.list(isub)`. – Frank May 01 '13 at 18:02
  • @Frank - yep, and I don't know how all of the above examples combined fit "prefixed by !" - it doesn't mean literal string prefixing (as `!x&1` shows), and it doesn't mean expression prefixing (as `(!..)` or `{!...}` or `eval(quote(!...))` show) - it's some hybrid thing the meaning of which is only clear after a lot of experimentation/reading the implementation code – eddi May 01 '13 at 18:23
  • It might just be first in order of R's operations when interpreting an expression. `?Syntax` I think it's expected that, when you see `!x&y`, you know the `&` operates first...? – Frank May 01 '13 at 18:27
  • @Frank yes, that's exactly what it is and how I constructed that example - my point is to demonstrate that "prefix" is an inadequate description of what's happening. – eddi May 01 '13 at 18:29
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/29263/discussion-between-frank-and-eddi) – Frank May 01 '13 at 18:38
4

As of version 1.8.11 the ! does not trigger a not-join for logical expressions and the results for the two expressions are the same:

DT <- data.table(x=c(1,0,NA), y=1:3)
DT[x != 0]
#   x y
#1: 1 1
DT[!(x == 0)]
#   x y
#1: 1 1

A couple other expressions mentioned in @mnel's answer also behave in a more predictable fashion now:

DT[!(x != 0)]
#   x y
#1: 0 2
DT[!!(x == 0)]
#   x y
#1: 0 2
eddi
  • 49,088
  • 6
  • 104
  • 155
3

I'm a month late to this discussion, but with fresh eyes and reading all the comments ... yes I reckon DT[x != .] would be better if it included any rows with NA in x in the result, and we should change it to do that.

New answer added to the linked question with further background from a different angle :

https://stackoverflow.com/a/17008872/403310

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
0

My view is that subset does the right thing and both data.table and data.frame don't, with data.frame doing the silliest of them all. So as far as your question goes - no, I don't think data.table should do the same thing as data.frame, it should do the same thing as subset.

For the record, here's the output of subset:

subset(DF, x != 0)
#  x y
#1 1 1
subset(DF, !(x == 0))
#  x y
#1 1 1
#
# or if you want the NA's as well
subset(DF, is.na(x) | x != 0)
#   x y
#1  1 1
#3 NA 3

I want to elaborate a little bit on why data.frame output is silly. The very first line in [.data.frame description says - "Extract or replace subsets of data frames". The output that it returns, where it has a row with rowname = NA and all of the elements equal to NA are in no sense "subsets" of the given data frame, making the output inconsistent with the meaning of the function. It's also a huge hassle from the user's point of view as one has to be always aware of these things and find ways to work around this behavior.

As far as data.table output goes - it's clearly inconsistent, but at least less silly, in that in both cases it actually returns subsets of the original data table.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • @Arun putting words into why you disagree with the argument I gave will be more useful; 2nd q: use `is.na` – eddi Apr 26 '13 at 15:37
  • @Arun `NA` is NOT not equal to 0. By definition of `NA`, asking if it's equal to anything (including itself) doesn't make sense, thus returning `NA`. – eddi Apr 26 '13 at 15:43
  • to get all entries != 0 including `NA`'s, you should write `is.na(x) | x != 0` (and this is exactly how `subset` syntax works) – eddi Apr 26 '13 at 15:47
  • 1
    `data.table` is mimicing `subset` in its handling of `NA` values in logical `i` arguments. -- the only issue is the `!` prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been `NJ` not `!` to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) – mnel Apr 29 '13 at 23:12
  • @mnel - you mean use `NJ(x == 0)` instead of `!(x == 0)`? I'd be interested to see that discussion if you open it (not so interested in opening myself, as I don't yet see how `NJ` is better). – eddi Apr 30 '13 at 15:39