Select NA in a data.table in R

Question

How do I select all the rows that have a missing value in the primary key in a data table.

DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)

Selecting for a particular value is easy

DT["a",]

Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct?

DT[NA,]# does not work
DT[is.na(x),] #does work

Josh O'Brien · Accepted Answer · 2015-02-22T21:07:05.603

Fortunately, DT[is.na(x),] is nearly as fast as (e.g.) DT["a",], so in practice, this may not really matter much:

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)  

benchmark(DT["a",],
          DT[is.na(x),],
          replications=20)
#             test replications elapsed relative user.self sys.self user.child
# 1      DT["a", ]           20    9.18    1.000      7.31     1.83         NA
# 2 DT[is.na(x), ]           20   10.55    1.149      8.69     1.85         NA

===

Addition from Matthew (won't fit in comment) :

The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).

benchmark(DT["a",],  # repeat select of large subset on my netbook
    DT[is.na(x),],
    replications=3)
          test replications elapsed relative user.self sys.self
     DT["a", ]            3   2.406    1.000     2.357    0.044
DT[is.na(x), ]            3   3.876    1.611     3.812    0.056

benchmark(DT["a",which=TRUE],   # isolate search time
    DT[is.na(x),which=TRUE],
    replications=3)
                      test replications elapsed relative user.self sys.self
     DT["a", which = TRUE]            3   0.492    1.000     0.492    0.000
DT[is.na(x), which = TRUE]            3   2.941    5.978     2.932    0.004

As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.

Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_ is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey going slower. But it's on the list to revisit.

@Matthew -- Using `which=TRUE` to isolate the search time is neat idea. Also, `DT[is.na(x),]` still seems to beat anything else I've been able to come up with. I initially tried `DT[-DT[na.omit(unique(DT[,1,with=FALSE])), which=TRUE],]`, but it's as *sloow* as you might guess. Not that it's the problem in the above, but is `unique()` really the fastest way to extract the unique levels of a keyed column (or set of keyed columns)? — Josh O'Brien, Sep 28 '12 at 21:30
Hi Josh. I had quick look. Seems that `[,1,with=FALSE]` retains key ok, so `unique.data.table` should call `duplist` which works like `which(diff!=0)` but on lists of columns. So yes the unique part should be as fast as a vector scan because it knows it's sorted. But because all but one of the groups are being joined to, it might be hitting a known (to me) wrinkle. Search [data.table.R](https://r-forge.r-project.org/scm/viewvc.php/pkg/R/data.table.R?view=markup&root=datatable) for `mapply` and the comment there. I guess that's biting in this case. Hoping to replace with `bit::vecseq` asap! — Matt Dowle, Sep 29 '12 at 20:12
Also, that would benefit from column references, which isn't yet implemented. Currently, `DT[,j,with=FALSE]` takes a copy of the j columns into the new table. A shallow copy would be instant, but it needs to mark them, so a copy-on-change at column level can be done if `:=` is used later. — Matt Dowle, Sep 29 '12 at 20:19
Josh, I've finally zapped the horrid internal call to `mapply`. In the end `bit::vecseq` was implemented in R, so I've done that in C, and use that internally. The slowdown of many minutes is down to a few seconds now as it should be. Workaround of `mult="first"` no longer needed. In v1.8.3 on R-Forge. Well, I hope. At least in my tests. — Matt Dowle, Oct 29 '12 at 22:35

Arun · Answer 2 · 2013-12-15T01:24:57.807

20

This is now implemented in v1.8.11. From NEWS:

o Binary search is now capable of subsetting NA/NaNs and also perform joins and merges by matching NAs/NaNs.

Although you'll have to provide the correct NA (NA_real_, NA_character_ etc..) explicitly at the moment.

On OP's data:

DT[J(NA_character_)] # or for characters simply DT[NA_character_]
#     x y v
# 1: NA 1 7
# 2: NA 3 8
# 3: NA 6 9

Also, here's the same benchmark from @JoshOBrien's post, with this binary search for NA added:

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)  

benchmark(DT["a",],
          DT[is.na(x),],
          DT[NA_character_], 
          replications=20)

            test replications elapsed relative user.self sys.self
1      DT["a", ]           20   4.763    1.238     4.000    0.567
2 DT[is.na(x), ]           20   5.399    1.403     4.537    0.794
3         DT[NA]           20   3.847    1.000     3.215    0.600 # <~~~

edited Dec 15 '13 at 01:24

answered Dec 15 '13 at 01:06

Arun

116,683
26
284
387

1

+1 Great! In the very last line it looks like `DT[NA]` now does a join? – Matt Dowle Dec 15 '13 at 03:37
2

@MattDowle, no `rbenchmark` seems to change it to `NA` from `NA_character_` to display for some reason. – Arun Dec 15 '13 at 08:07
Nice, but you can't get 1.8.11 from CRAN, you need R-Forge, and their binary building is currently broken: http://stackoverflow.com/questions/18772277/installing-new-version-of-data-table-specifically-1-8-11-from-rforge CRAN just gives 1.8.10 – smci Feb 25 '14 at 06:56
nice (+1). How do you filter not NA. For example: `DT[J(!NA_real_)]` did not work. – Andre Silva Aug 04 '14 at 13:40
3

@AndreSilva, try `DT[!J(NA_real_)]`. – Arun Aug 04 '14 at 16:15
How do you use this in a column that is not the key? – Herman Toothrot Dec 06 '18 at 14:59

Select NA in a data.table in R

2 Answers2

This is now implemented in v1.8.11. From NEWS:

Linked

Related