0

I have this dummy dataset:

abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"))

where I am trying to replace "NA" with standard NA; in place using data.table. I tried:

for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA])
for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA_character_])
for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))

However still with this I get:

abc$a 
"NA" "bc" "x"

What am I missing?

EDIT: I tried @frank answer in this question which makes use of type.convert(). (Thanks frank; didn't know such obscure albeit useful function) In documentation of type.convert() it is mentioned: "This is principally a helper function for read.table." so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical. For such case abc will be:

abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

EDIT2: To summerize code present in original question:

for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))

works fine but only in current latest version of data.table (> 1.11.4). So if one is facing this problem then its better to update data.table and use this code than type.convert()

abhiieor
  • 3,132
  • 4
  • 30
  • 47
  • 1
    We've given you different ways to do things because many of us avoid loops... but it is possible to make your code work. Try `for(i in names(abc)) (abc[which(abc[[i]] == "NA"), (i) := NA])` for me and see if that works. `i := ` says "store results in column "i". `(i) := ` says "evaluate the contents of "i" as a column name and then store results there". DON'T FORGET that RStudio does NOT update the view of the datatable when you run your code. You have to close the preview and then re-open it in order to see results. This is done for speed purposes. – Adam Sampson Aug 06 '18 at 17:56
  • 1
    FYI, your `for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))` works on my computer...You just have to refresh the view in RStudio because datatable updates using `:=` do not refresh the view automatically. – Adam Sampson Aug 06 '18 at 17:58

3 Answers3

5

I'd do...

chcols = names(abc)[sapply(abc, is.character)]
abc[, (chcols) := lapply(.SD, type.convert, as.is=TRUE), .SDcols=chcols]

which yields

> str(abc)
Classes ‘data.table’ and 'data.frame':  3 obs. of  3 variables:
 $ a: chr  NA "bc" "x"
 $ b: num  1 2 3
 $ c: chr  "n" NA NA
 - attr(*, ".internal.selfref")=<externalptr> 

Your DT[, i :=] code did not work because it creates a column literally named "i"; and your set code does work already, as @AdamSampson pointed out. (Note: OP upgraded from data.table 1.10.4-3 to 1.11.4 before this was the case on their comp.)


so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical.

Oh right. Your original approach is safer against this problem:

# op's new example
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

# op's original code
for(i in names(abc)) 
  set(abc, which(abc[[i]] == "NA"), i, NA)

Side note: NA has type logical; and usually data.table would warn when assigning values of an incongruent type to a column, but I guess they wrote in an exception for NAs:

DT = data.table(x = 1:2)
DT[1, x := NA]
# no problem, even though x is int and NA is logi

DT = data.table(x = 1:2)
DT[1, x := TRUE]
# Warning message:
# In `[.data.table`(DT, 1, `:=`(x, TRUE)) :
#   Coerced 'logical' RHS to 'integer' to match the column's type. Either change the target column ['x'] to 'logical' first (by creating a new 'logical' vector length 2 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Pardon but `set()` approach doesn't work in my computer with data.table 1.10.4.3. While I print `abc$a` after set command I still get `"NA" "bc" "x"` – abhiieor Aug 06 '18 at 18:24
  • 1
    @abhiieor Hm, not sure why that would happen. (I'm running data.table version 1.11.5 from github on r 3.3.3) You could try upgrading to the latest CRAN version, 1.11.4 in case it was a short-term bug. I see it working in both RStudio (1.0.44) and R console – Frank Aug 06 '18 at 18:30
  • 2
    Indeed solved in 1.11.4. Should have updated data.table. – abhiieor Aug 06 '18 at 18:43
1

I really liked Frank's response, but want to add to it because it assumes you're only performing the change for character vectors. I'm also going to try to include some info on "why" it works.

To replace all NA you could do something like:

chcols = names(abc)
abc[,(chcols) := lapply(.SD, function(x) ifelse(x == "NA",NA,x)),.SDcols = chcols]

Let's breakdown what we are doing here.

We are looking at every row in abc (because there is nothing before the first comma).

After the next comma is the columns. Let's break that down.

We are putting the results into all of the columns listed in chcols. The (chcols) tells the data.table method to evaluate the vector of names held in the chcols object. If you left off the parentheses and used chcols it would try to store the results in a column called chcols instead of using the column names you want.

.SD is returning a data.table with the results of every column listed in .SDcols (in my case it is returning all columns...). But we want to evaluate a single column at a time. So we use lapply to apply a function to every column in .SD one at a time.

You can use any function that will return the correct values. Frank used type.convert. I'm using an anonymous function that evaluates an ifelse statement. I used ifelse because it evaluates and returns an entire vector/column.

You already know how to use a := to replace values in place.

After the next column you either put the by information or you put additional options. We will add additional options in the form of .SDcols.

We need to put a .SDcols = chcols to tell data.table which columns to include in .SD. My code is evaluating all columns, so if you left off .SDcols my code would still work. But it's a bad habit to leave this column off because you can lose time in the future if you make a change to only evaluate certain columns. Frank's example only evaluated columns that were of the character class for instance.

Adam Sampson
  • 1,971
  • 1
  • 7
  • 15
  • 1
    As another example of a function to put there, you can also use `replace(x, x=="NA", NA)`; in my experience, ifelse is messier to work with. Not sure I get the point about wanting to include all columns. There is no such thing as "NA" except in character columns, right? – Frank Aug 06 '18 at 18:21
  • 1
    Sweet. I always forget replace. I need to put a "replace" jar next to my swear jar for every time I forget replace. – Adam Sampson Aug 06 '18 at 19:31
  • As for NA in non-character columns, I'm not sure about in his data. NA just means "Not Available" or missing. I decided to err on the side of caution and assume there could be missing data in other types of data in his set. – Adam Sampson Aug 06 '18 at 19:33
0

Here are two other approaches:

Subsetting

library(data.table)
abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
                   c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

for (col in names(abcd)) abcd[get(col) == "NA", (col) := NA]
abcd[]
      a b    c    d
1: <NA> 1    n <NA>
2:   bc 2 <NA> <NA>
3:    x 3 <NA> <NA>

Update while joining

Here, data.table is rather strict concerning variable type.

abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
                   c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

for (col in names(abcd)) 
  if (is.character(abcd[[col]])) 
    abcd[.("NA", NA_character_), on = paste0(col, "==V1"), (col) := V2][]
abcd
      a b    c    d
1: <NA> 1    n <NA>
2:   bc 2 <NA> <NA>
3:    x 3 <NA> <NA>
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134