1

I have a large list of data.tables (~10m rows in total) that contain a number of string variants of NA, such as "N/A" or "Unknown". I would like to replace these observations with a missing value, for all of the columns in all of my data.tables.

A simplified example of the data is set out below:

library(data.table)
dt1 <- data.table(v1 = 1:4, v2 = c("yes", "no", "unknown", NA))
dt2 <- data.table(v1 = c("1", "2", "not applicable", "4"), v2 = c("yes", "yes", "n/a", "no"))
master_list <- list(dt1 = dt1, dt2 = dt2)

The following solution works, but it is taking a prohibitively long time (~30 minutes with the full data) so I am trying to find a more efficient solution:

unknowns <- c("n/a", "not applicable", "unknown")
na_edit <- function(x){ifelse(x %in% unknowns, NA, x)}

master_list <- lapply(master_list, function(dt) {
 dt[, lapply(.SD, na_edit)]
})

> master_list$dt1
   v1   v2
1:  1  yes
2:  2   no
3:  3 <NA>
4:  4 <NA>

I have tried something resembling the following, removing the need for the ifelse, but I have not been able to make this work across multiple columns.

lapply(master_list, function(dt) {
  dt[v2 %in% unknowns, v2 := NA]
})

I feel an answer may lie in the responses in this thread. Would anyone be able to help me apply similar, or other methods, to the problem above? Many thanks in advance.

rsco
  • 68
  • 5
  • 4
    A better option would be to specify in `fread`, `na.strings` while reading (as it is a `data.table` input) or with `read.table/read.csv` instead of post-processing for conversion to missing values `NA`, It may be done by `lapply(files, fread, na.strings = c("n/a", "not applicable", "unknown"))` – akrun Apr 24 '19 at 15:55
  • 1
    @akrun Thanks. I agree it's one option. But I'm adding to the list of strings to replace, and would prefer not to need to return to the time-consuming data loading stage each time. – rsco Apr 24 '19 at 16:32
  • 1
    ok, last I checked `fread` reads it pretty fast – akrun Apr 24 '19 at 16:34
  • @rsco, is there any reason for keeping the data.tables in a list? otherwise, you can use `rbindlist(master_list)[v2 %chin% c("n/a", "not applicable", "unknown"), v2 := NA_character_][]` – chinsoon12 Apr 25 '19 at 02:29
  • @chinsoon12 I'm running quite a lot of data cleaning operations first that are specific to each data.table. And then will use `rbindlist` afterwards. Some of the cleaning relies on having recoded the variants of NA. – rsco Apr 25 '19 at 09:03
  • maybe rbindlist, recode and then data.table::split might be faster than what you are doing now – chinsoon12 Apr 25 '19 at 09:23

0 Answers0