5

I have a column in a data table with NaN values. Something like:

my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14))
setkey(my.dt, x)

I can use the J() function to find all instances where the x column is equal to .2

> my.dt[J(.2)]

     x  y
1: 0.2 10
2: 0.2 12

But if I try to do the same thing with NaN it doesn't work.

> my.dt[J(NaN)]

     x  y
1: NaN NA

I would expect:

     x  y
1: NaN  2
2: NaN  4
3: NaN  6

What gives? I can't find anything in the data.table documentation to explain why this is happening (although it may just be that I don't know what to look for). Is there any way to get what I want? Ultimately, I'd like to replace all of the NaN values with zero, using something like my.dt[J(NaN), x := 0]

Wilduck
  • 13,822
  • 10
  • 58
  • 90
  • 1
    +1. Very strange. When I run `my.dt[J(NaN), x := 0]` it destroys the key but has no effect on the values in the data.table... To get what you want (sort of), use `my.dt[!J(unique(x)),x:=0]`. Don't ask me why it works, though! – Frank Oct 08 '13 at 02:13
  • @Frank I'm seeing the same result/effect (or lack thereof). – Wilduck Oct 08 '13 at 02:17
  • 3
    You could just do `my.dt[is.nan(x),x:=0]`. – mrip Oct 08 '13 at 02:19
  • 1
    @mrip Yeah. I guess the goal is to use the radix sort of whatever speedy magic is invoked by `J`, and to have consistency. But what you propose is what I would do before setting the key. One could argue that NA and NaN are not valid values for a key to take... – Frank Oct 08 '13 at 02:21
  • Yeah, you wouldn't get the speedy magic with `is.nan`. – mrip Oct 08 '13 at 02:22
  • @mrip you're right, that'll work. If you want to take the time to turn that into an answer with a little more explanation then if nothing more complete comes along by tomorrow morning, I'll accept it. – Wilduck Oct 08 '13 at 02:24
  • Well, I don't think it's worthy of being an answer because, as @Frank pointed out, it doesn't make use of the data table "speedy magic" that comes from indexing. But it will get the job done, and if you're only going to do it once, it shouldn't be too bad. – mrip Oct 08 '13 at 02:27
  • @mrip Actually, I think overwriting those values with `is.nan` *before* setting the key is the best course, so you could write it up. Whenever you edit a key column, the key needs to be reset anyways, which must also be costly, I guess: `DT <- data.table(x=1:2,y=3:4,key='x');DT[.(1L),x:=0L];key(DT)` – Frank Oct 08 '13 at 02:32
  • This is related to http://stackoverflow.com/questions/12646172/select-na-in-a-data-table-in-r - Note that a single vector scan (is.nan) is comparable to the binary join. @frank see ttps://r-forge.r-project.org/tracker/index.php?func=detail&aid=1043&group_id=240&atid=978 – mnel Oct 08 '13 at 03:29
  • @mnel Thanks for the references; I didn't know anything about this. For some reason, copy-pasting that link didn't work for me so [here's another version](http://r-forge.r-project.org/tracker/index.php?func=detail&aid=1043&group_id=240&atid=978) – Frank Oct 08 '13 at 03:38
  • possible duplicate of [data.table inner/outer join with NA in join column of type double bug?](http://stackoverflow.com/questions/14076065/data-table-inner-outer-join-with-na-in-join-column-of-type-double-bug) or http://stackoverflow.com/questions/13521843/data-table-join-na-keys or http://stackoverflow.com/questions/17508127/na-in-i-expression-of-data-table-possible-bug – Ricardo Saporta Oct 08 '13 at 04:53
  • @Wilduck, this has been fixed a while back, in 1.9.2. I've updated the post with relevant info. Could you please verify and accept the answer, so that it remains closed? Thank you! – Arun Sep 07 '14 at 10:04
  • 1
    @Arun Thanks for the update, and for keeping the tag clean. I've accepted the updated answer. – Wilduck Sep 08 '14 at 14:17

3 Answers3

3

Update: This has been fixed a while back, in v1.9.2. From NEWS:

NA, NaN, +Inf and -Inf are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.

require(data.table) ## 1.9.2+
my.dt[J(NaN)]
#      x  y
# 1: NaN  2
# 2: NaN  4
# 3: NaN  6

This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table key.

The main idea is outlined in the FAQ in that NA's are treated as FALSE

Please feel free chime in on the conversation in the mailing list. There was a conversation started by @Arun,

http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html

Also, you can read more in the answers and comments to any of the following questions on SO:

subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently


In the meantime, your best bet is to use is.na.
While it is slower than a radix search, it is still faster than most vector searches in R, and certainly much, much faster than any fancy workarounds

library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
               expr    median 
        my.dt[.(1)]  1.309948 
   my.dt[is.na(ID)]  3.444689   <~~ Not bad
     my.dt[ID == 1]  4.005093 
 my.dt[!(!(!(ID)))] 10.038134 

### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
Community
  • 1
  • 1
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • following one of the links I saw Matt use `which = TRUE` to isolate search times, and that makes the runtime difference between `J(1)` and `is.na(ID)` a bit larger – eddi Oct 08 '13 at 15:41
  • @eddi, why would you want the run time to be larger? – Ricardo Saporta Oct 08 '13 at 15:59
  • the *difference* becomes larger - my point was simply that there are other things happening here besides the search and `which = TRUE` isolates them better (and makes `is.na` look a bit worse) – eddi Oct 08 '13 at 16:13
3

Here's a fast workaround that relies a lot on what's actually happening internally (making the code a bit fragile imo). Because internally NaN is just a very very negative number, it will always be at the front of your data.table when you setkey. We can use that property to isolate those entries like so:

# this will give the index of the first element that is *not* NaN
my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T]

# this is equivalent to my.dt[!is.nan(x)], but much faster
my.dt[seq_len(my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] - 1)]

Here's a benchmark for Ricardo's sample data:

my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)

# NOTE: I have to use integer max here - because this example has integers
# instead of doubles, so I'll just add simple helper function (that would
# likely need to be extended for other cases, but I'm just dealing with the ones here)
minN = function(x) if (is.integer(x)) -.Machine$integer.max else -.Machine$double.xmax

library(microbenchmark)
microbenchmark(normalJ = my.dt[J(1)],
               naJ = my.dt[seq_len(my.dt[J(minN(ID)), roll = -Inf, which = T] - 1)])
#Unit: milliseconds
#    expr      min       lq   median       uq       max neval
# normalJ 1.645442 1.864812 2.120577 2.863497  5.431828   100
#     naJ 1.465806 1.689350 2.030425 2.600720 10.436934   100

In my tests the following minN function also covers character and logical vectors:

minN = function(x) {
  if (is.integer(x)) {
    -.Machine$integer.max
  } else if (is.numeric(x)) {
    -.Machine$double.xmax
  } else if (is.character(x)) {
    ""
  } else if (is.logical(x)) {
    FALSE
  } else {
    NA
  }
}

And you will want to add mult = 'first', e.g.:

my.dt[seq_len(my.dt[J(minN(colname)), roll = -Inf, which = T, mult = 'first'] - 1)]
eddi
  • 49,088
  • 6
  • 104
  • 155
0

See if this is helpful.

my.dt[!is.finite(x),]
     x y
1: NaN 2
2: NaN 4
3: NaN 6
IRTFM
  • 258,963
  • 21
  • 364
  • 487