0

I'm trying to replace NA cells by some value but only in one column. I found another thread explaining how to proceed but I don't understand how it works.

is.na(dt) returns a data table tracing the original dt but replacing all the values by either TRUE or FALSE depending on whether the original cell is NA. Now a datatable first parameters is supposed to accept a logical vector to select lines, not a whole datatable. And indeed dt[is.na(dt)] returns an error but dt[is.na(dt)]=0 will replace all the NA values with 0. Why does adding an =0 suddenly makes this call work ? Is it a special feature or part of datatable design.

Community
  • 1
  • 1
ChiseledAbs
  • 1,963
  • 6
  • 19
  • 33
  • @akrun `is.na(dt)` with `dt` being a `data.table` returns a matrix `class(is.na(dt)) : matrix` tracing the original datatable whose cells have been replaced with booleans. – ChiseledAbs Dec 09 '16 at 02:24
  • I get error in both cases i.e. `setDT(dt)[is.na(dt)] <- 0# Error in setDT(dt)[is.na(dt)] <- 0 : could not find function "setDT<-"` using v.1.10.0 – akrun Dec 09 '16 at 02:26
  • 1
    Relevant - http://stackoverflow.com/questions/20535505/replacing-all-missing-values-in-r-data-table-with-a-value – thelatemail Dec 09 '16 at 03:24

1 Answers1

2

The expression would work if it is a data.frame

dt[is.na(dt)]
#[1] NA NA NA NA NA

But, in a data.table, the syntax is different and converting to logical matrix is inefficient and not recommended in v1.10.0

setDT(dt)[is.na(dt)]

Error in [.data.table(setDT(dt), is.na(dt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your

A better option is set which replaces in place without copying

for(j in seq_along(dt)) {
  set(dt, i = which(is.na(dt[[j]])), j = j, value = 0)
}   

dt
#    a b c
# 1: 1 0 2
# 2: 2 2 2
# 3: 2 1 1
# 4: 2 0 1
# 5: 0 1 2
# 6: 2 0 5
# 7: 1 1 4
# 8: 1 1 0
# 9: 2 1 5
#10: 2 1 1

Or another version is

setDT(dt)[, lapply(.SD, function(x) replace(x, is.na(x), 0))]

data

dt <- structure(list(a = c(1L, 2L, 2L, 2L, NA, 2L, 1L, 1L, 2L, 2L), 
b = c(NA, 2L, 1L, NA, 1L, NA, 1L, 1L, 1L, 1L), c = c(2L, 
2L, 1L, 1L, 2L, 5L, 4L, NA, 5L, 1L)), .Names = c("a", "b", 
"c"), class = "data.frame", row.names = c(NA, -10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • but why does `dt[i, j]` accept a matrix as `i` when `=0` is added, is this feature documented somewhere ? Usually `i` is a logical vector to select specific lines in `dt` – ChiseledAbs Dec 09 '16 at 02:31
  • @ChiseledAbs The `is.na(dt)` is a logical matrix and not a vector – akrun Dec 09 '16 at 02:33
  • @ChiseledAbs It just extracts a single value and not multiple values by what you showed `dt[cbind(c(1,3), c(1,2))] #Error in `[.data.table`(dt, cbind(c(1, 3), c(1, 2))) : i is invalid type (matrix). Perhaps in future a 2 column m` – akrun Dec 09 '16 at 02:38
  • exactly that's my point, `dt[cbind(c(1,3), c(1,2))]` yields and error but `dt[cbind(c(1,3), c(1,2))]=0` doesn't, I'm wondering how this works. – ChiseledAbs Dec 09 '16 at 03:13
  • 1
    Your second example only doesn't work because there is no `setDT<-` function. It works just fine if you don't try to do it all in one step - `setDT(dt); dt[is.na(dt)] <- 0` Not that I recommend doing this as per the question I linked above. – thelatemail Dec 09 '16 at 03:26
  • 1
    I believe the second code ends up dispatching to `\`[<-.data.frame\`` as you can see by running `debug(\`[<-.data.frame\`)` before trying the assignment operation. Presumably because `data.table`s are also `data.frame`s – thelatemail Dec 09 '16 at 03:35