23

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.

aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]

    V1  X V2
 1:  1 NA  1
 2:  2 NA  2
 3:  3  a  2
 4:  4  b  3
 5:  5  c  3
 6:  6  d  3
 7:  7 NA  4
 8:  8 NA  4
 9:  9 NA  4
10: 10 NA  4

Any way to do this in one line? If it were just a matrix, you could just do:

tt[is.na(tt)] = 0
FBC
  • 993
  • 3
  • 9
  • 15

4 Answers4

31

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

Why shouldn't you use <- here:

# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt)) 
    set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
#   user  system elapsed 
#  0.284   0.083   0.386 

# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
#   user  system elapsed 
#  4.110   0.976   5.187 
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    Great answer. Not strictly required, but any way to do this without a for loop, e.g. using lapply? Or does that make copies of the columns too? – FBC Dec 12 '13 at 18:50
  • I'm not getting anywhere near the time increase that you are seeing, regardless of whether I use `tracemem(tt)` before each run the respective elapsed times are about 0.34 and 0.42. `<-` is slower, but nowhere near as slow as your benchmarks. – thelatemail Dec 12 '13 at 22:57
  • @thelatemail, I just checked again. It's 0.386 vs 5.05 sec. – Arun Dec 12 '13 at 23:09
  • did you generate tt again before to run system.time? The first one modifies it by reference... – Arun Dec 12 '13 at 23:11
  • I've been using `tt2 <- tt <- data.table(...` and running the first `for` code block on `tt` and the second `<-` code on `tt2` – thelatemail Dec 12 '13 at 23:22
  • Nope, won't work. The first line assignment won't make a copy. And the for-loop does things by reference on tt, so won't make a copy as well. But `tt2` points to the same memory and therefore will be changed by reference as well with all NAs replaced. – Arun Dec 12 '13 at 23:24
  • You have to regenerate tt from scratch before comparing the performance. I also observe similar speedups with Arun's answer. – FBC Dec 13 '13 at 00:47
  • Thanks @Arun! This solution was brilliant! The best one I found so far that worked without throwing an error on a date field. – Alfredo G Marquez Aug 16 '17 at 21:25
  • @Arun The `data.table` help file states ":= is more powerful and flexible than `set`" which hasn't been my experience. Do you know whether for this particular example it is possible to use := instead? – AdamO Feb 06 '18 at 15:00
15

Nothing unusual here:

tt[is.na(tt)] = 0

..will work.

This is somewhat confusing however given that:

tt[is.na(tt)]

...currently returns:

Error in [.data.table(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Yes, I think what happened was that I tried it and got that error and was confused, didn't think of actually trying to set it to zero directly. Thanks! – FBC Dec 12 '13 at 05:58
  • 2
    this'd not be the idiomatic way as this'll make a copy of all the columns. – Arun Dec 12 '13 at 13:51
  • 1
    perfect. I stopped when I saw that error. Whodathunk an assignment would work after that... – hedgedandlevered Jan 27 '16 at 22:26
  • I gather it is because data.table specific code is called when trying to do the selection, but general data.frame code is used when doing the replacement operation. Confusing, but this is a particularly fringe example. – thelatemail Mar 15 '17 at 22:08
0

I would make use of data.table and lapply, namely:

tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]

yielding in:

V1    X V2
 1:  1 -666  1
 2:  2 -666  2
 3:  3    a  2
 4:  4    b  3
 5:  5    c  3
 6:  6    d  3
 7:  7 -666  4
 8:  8 -666  4
 9:  9 -666  4
10: 10 -666  4
amonk
  • 1,769
  • 2
  • 18
  • 27
  • 1
    That is quite slow because `ifelse()` updates all values. Using [Arun's benchmark code](https://stackoverflow.com/a/20545629/3817004), it is 15 times slower than the loop using `set()`. – Uwe Feb 12 '18 at 19:16
0

The specific problem OP is posting could also be solved by

tt[is.na(X), X := 0]
altabq
  • 1,322
  • 1
  • 20
  • 33