3

When i contains NA, that particular row is not returned. I am not sure this is the intended behavior or is it?

require(data.table)
x = data.table(a=c(NA, 1:3, NA))    
x[a>0]       
   a
1: 1
2: 2
3: 3

x[!(a>0)]
    a
1: NA
2: NA

x[a<0]   
Empty data.table (0 rows) of 1 col: a

x[!(a<0)]
    a
1: NA
2:  1
3:  2
4:  3
5: NA

 > sessionInfo()
 R version 2.15.2 (2012-10-26)
 Platform: x86_64-unknown-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
  [7] LC_PAPER=C                 LC_NAME=C                 
  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
  [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

 other attached packages:
  [1] data.table_1.8.8
Alex
  • 19,533
  • 37
  • 126
  • 195
  • 1
    Interesting... May I suggest you use `a=c(NA, 1:3, NA)` in your example? Having such a long example is not useful or relevant. – flodel Jul 07 '13 at 00:11
  • 1
    Put differently, `x[as.logical(a)]` and `x[!!as.logical(a)]` do not give the same result. Tempted to call it a bug. In that case, I'd highly suggest you make sure you are using the latest version of the package and add that information (version number) into your question. – flodel Jul 07 '13 at 00:21
  • `x[as.logical(!a<0)]` seems to remove the NA values... – agstudy Jul 07 '13 at 00:22
  • just updated with smaller example and some other results – Alex Jul 07 '13 at 01:46
  • 1
    I think this is more relevant as well: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently – Arun Jul 07 '13 at 01:47
  • @Arun - much better dup example. I had missed that Q previously – Ricardo Saporta Jul 07 '13 at 01:53

2 Answers2

3

As @flodel points out, the question can be simplified to, Why is this not TRUE:

identical(x[as.logical(a)], x[!!as.logical(a)])   # note the double bangs

The answer lies in how data.table handles NA in i and how it handles ! in i. Both of which receive special treatment. The problem really arises in the combination of the two.

  • NA's in i are treated as FALSE.
  • ! in i are treated as a negation.

This is well documented in ?.data.table (as G. Grothendieck points out in another answer). The relevant portions being:

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.
...
All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present.

If you look at the code for [.data.table, the way ! are handled, if present, is by

  1. removing the preceding !
  2. Interpreting the remaining i
  3. Negating that interpretation

The way NAs are handled is by setting those values to FALSE.
However -- and very importantly -- this happens within step 2 above.

Thus, what is really happening is that when i contains NA AND i is prefixed by !, then the NA's are effectively interpreted as TRUE. While technically, this is as documented, I am not sure if this is as intended.


Of course, there is the final question of @flodel's point: Why is x[as.logical(a)] not the same as x[!!as.logical(a)]? The reason for this is that only the first bang gets special treatment. The second bang is interpreted as normal by R.

Since !NA is still NA, the sequence of modification for the interpretation of !!(NA) is:

!!(NA)  
!( !(NA) )  
!(  NA   )
!( FALSE )
TRUE
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
1

This is documented behavior. See the description of the i argument in ?data.table .

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • 1
    Gabor, I would argue that what is documented may be slightly different from what we are seeing here. Specifically, I think @flodel's comment is very relevant. Rather, the "NA's interpreted to be FALSE" should occur after the `!`. ie that `!NA` should be treated the same as `NA` – Ricardo Saporta Jul 07 '13 at 01:00
  • Agreed. To throw `NA` away or not is a documented design choice. `[.data.frame` keeps them while both `[.data.table` and `subset` don't. Nothing wrong with that. The problem is that `data.table` seems to have a serious issue enforcing consistent logical rules. I find normal/intuitive that `subset(x, as.logical(a))` and `subset(x, !!as.logical(a))` return the same result. However, in the case of `[.data.table`, `x[as.logical(a)]` and `x[!!as.logical(a)]` do not... – flodel Jul 07 '13 at 01:51
  • Also the feature request and last paragraph of @MatthewDowle's answer here: http://stackoverflow.com/questions/16221742/subsetting-a-data-table-using-some-non-na-excludes-na-too/17008872#17008872 make me really uncomfortable. Most of it seems to rely on the wrong (or new?) concept that `!NA` should be `TRUE`. – flodel Jul 07 '13 at 01:52
  • @flodel, you can see the follow-up (extensive) discussion of the post you've linked above [**here**](http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-June/001856.html). IIUC, we should be expecting some changes. – Arun Jul 07 '13 at 02:06
  • 1
    @Arun. It's spot on. Thanks for having brought it up to the author's attention. – flodel Jul 07 '13 at 02:23