1

I am trying to replace NAs in non-numeric and logical columns using following code:

test_dt <- data.table(a = c("foo", "bar", "foo_bar"),
                      b = c(1.243, NA, 78454),
                      c = c(NA, NA, NA),
                      d = c(1.242345235, 2.3453255635, 475.253552352),
                      e = as.POSIXlt(c(NA, rep(Sys.time(), 2)), origin = as.POSIXlt(Sys.time(), "GMT"), tz = "GMT"),
                      f = c(T, F, NA),
                      g = as.Date(c(Sys.Date(), Sys.Date() - 5, NA)))

replaceNABlank <- function(DT, cols) {
  for (j in cols)
    set(DT,which(is.na(DT[[j]])) ,j, '')
  print(DT)
}

to_quote <- names(test_dt)[!(sapply(test_dt, class) %in% c('logical', 'numeric', 'integer'))]
options(useFancyQuotes = FALSE)

test_dt <- test_dt[, (to_quote) := lapply(.SD, as.character), .SDcols = to_quote]
test_dt1 <- replaceNABlank(test_dt, to_quote)

sample data is provided in code.

In output print(DT) prints correctly but test_dt1 is NULL. I tried to adopt the solution for Fastest way to replace NAs in a large data.table in my case but it doesn't seems to be working. Any explaination?

abhiieor
  • 3,132
  • 4
  • 30
  • 47

1 Answers1

2

I believe the issue is around your return value from your function. You use print(DT), but if you want to assign the actual result you should return simply DT. So one method would be to change the function to be:

replaceNABlank <- function(DT, cols) {
  for (j in cols)
    set(DT,which(is.na(DT[[j]])) ,j, '')
  DT
}

However, since data.table::set updates columns by reference you might also consider doing something like:

test_dt[, (to_quote) := lapply(.SD, as.character), .SDcols = to_quote]
replaceNABlank(test_dt, to_quote)

test_dt
#         a         b  c          d                   e     f          g
#1:     foo     1.243 NA   1.242345                      TRUE 2018-05-09
#2:     bar        NA NA   2.345326 2066-09-15 06:43:38 FALSE 2018-05-04
#3: foo_bar 78454.000 NA 475.253552 2066-09-15 06:43:38    NA  
Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • @Frank you're right, it doesn't do anything when you're assigning a `print` return value. They could also change the function to return `DT` which would work with `<-`. I should really update my answer... – Mike H. May 09 '18 at 15:30
  • Ok. Even with `DT` on the last line, there is no reason to use `<-` on the last line. `DT = data.table(a = 1:2); DT2 <- set(DT, 2L, "a", 3L)` just leaves DT and DT2 referring to the same table `address(DT); address(DT2)` https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another – Frank May 09 '18 at 15:43
  • @Frank, I might be misunderstanding you but are you saying that you can change the function to remove the `DT` and then the OPs original code should work? I realize for the second method, there is not need to return `DT` from the function but thought it would be too cluttered to include a small variation of the function – Mike H. May 09 '18 at 16:05
  • I was just saying that I read "which would work with `<-`" as suggesting that `<-` makes sense after a modify-by-reference function (like `replaceNABlank`), but I don't think it does. Nah, I think it's good to have DT there / return it as you do, and consistent with `set`, `:=`, etc. – Frank May 09 '18 at 16:10
  • 1
    Ah yea you're right that was a dumb suggestion by me. Thanks for the help btw! – Mike H. May 09 '18 at 16:12