2

Beforehand

Most obvious answer to the title is that missings are represented with NA in R. Dummy data:

x <- c("a", "NA", "<NA>", NA)

We can transform all elements of x to characters using x_paste0 <- paste0(x). After doing so, the second and fourth elements are same ("NA") and to my knowledge this is why there is no way to backtransform x_paste0 to x.

addNA

But working with addNA indicates that it is not just the NA itself that represents missings. In x only the last element is a missing. Let's transform the vector:

x_new <- addNA(x)
x_new
[1] a    NA   <NA> <NA>
Levels: <NA> a NA <NA>

Interestingly, the fourth element, i.e. the missing is shown with <NA> and not with NA. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new)) we get FALSE. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0. But this is not true because we can actually backtransform x_new. See:

as.character(x_new)
[1] "a"    "NA"   "<NA>" NA

How does as.character know that the third element is "<NA>" and the fouth is an actual missing, i.e. NA?

2 Answers2

0

That's probably a uncleanness in the base:::print.factor() method.

x <- c("a", "NA", "<NA>", NA)

addNA(x)
# [1] a    NA   <NA> <NA>
# Levels: <NA> a NA <NA>

But:

levels(addNA(x))
# [1] "<NA>" "a"    "NA"   NA    

So, there are no duplicated levels.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    So there are no missings in x_new (see any(is.na(x_new))) but in the levels of x_new there are missings! Thus, we can simply drop the level which stands for the missing, i.e. `factor(x_new, levels= levels(x_new)[-which(is.na(levels(x_new)))])`. Thanks! –  Dec 24 '21 at 20:09
0

Usually you try to prevent this when you read your data, either a csv or other source. A bit of a silly demo using read.table on your vector sample data.

x <- c("a", "NA", "<NA>", NA)
x <- read.table(text = x, na.strings = c("NA", "<NA>", ""), stringsAsFactors = F)$V1
x
[1] "a" NA  NA  NA 

But if you want to fix it afterwards

x <- c("a", "NA", "<NA>", NA)
na_strings <- c("NA", "<NA>", "")

unlist(lapply(x, function(v) { ifelse(v %in% na_strings, NA, v) }))

[1] "a" NA  NA  NA 

some notes on factors and addNA

# to not be confused with character values pretending to be missing values but are not
x <- c("a", "b", "c", NA)

x_1 <- addNA(x)
x_1

# do not get confused on how the displayed output is
# [1] a    b    c    <NA>
# Levels: a b c <NA>
  
str(x_1)
# Factor w/ 4 levels "a","b","c",NA: 1 2 3 4

is.na(x_1) # as your actual values are 1, 2, 3, 4
# [1] FALSE FALSE FALSE FALSE

is.na(levels(x_1))
# [1] FALSE FALSE FALSE TRUE

# but nothing is lost
x_2 <- as.character(x_1)

str(x_2)
# chr [1:4] "a" "b" "c" NA

is.na(x_2)
# [1] FALSE FALSE FALSE  TRUE
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22
  • Well you provide a sample where there is just one missing value, simple as that. The rest are strings that you interpret / define yourself as something else, in this case "missing". And often the case, we want to convert them to NA's as well. Perhaps I misunderstood your question. – Merijn van Tilborg Dec 24 '21 at 16:27
  • Any function knows that the last value in your vector is `NA` and as the vector contains characters, it will be a `NA` with a typeof character. `typeof(NA) # [1] "logical" typeof(c("NA", NA)[2]) # [1] "character" typeof(c(1L, NA)[2]) # [1] "integer" typeof(as.character(c(1L, NA))[2]) # "character"`. This probably does not answer your question either, but perhaps it helps your understanding. – Merijn van Tilborg Dec 24 '21 at 16:39
  • Sure it does, you probably mean `any(is.na(levels(x_new)))` [1] TRUE – Merijn van Tilborg Dec 24 '21 at 17:44
  • I think I understand your question a bit better regarding this. Let me update my answer with this part. – Merijn van Tilborg Dec 24 '21 at 18:13
  • what is the purpose of x_new? `x <- c("a", "b", "c", NA, "e", NA) which(is.na(x)) # [1] 4 6` – Merijn van Tilborg Dec 24 '21 at 18:40
  • I understand that, but why starting with vector x where you directly can get your answer with which(is.na(x)), so what is the purpose of creating x_new first? But you can do so, get rid of your factors and get the index: `which(is.na(as.character(x_new)))` – Merijn van Tilborg Dec 24 '21 at 18:46
  • That is "output", look how it is stored, factors are stored as index numbers. Look at my example: `Factor w/ 4 levels "a","b","c",NA: 1 2 4 3 4` Note there are 4 factor levels and it stores 4 (which is our NA factor level) on place 3 and 5. When we do `which(is.na(as.character(addNA(c("a", "b", NA, "c", NA)))))` you get those positions `[1] 3 5` – Merijn van Tilborg Dec 24 '21 at 18:58
  • In my comments there is example with more NA's actually your example is confusing as it consists of only one NA with a few STRINGS having "NA" in the name but they will never ever be treated as NA values no matter how or no matter the type of it. But I am out of the discussion, look how factors are stored. advisable to use str() to see the factors and the levels. as.character and as any other thing like output will do is match the numeric index values back to the associated level. And really for R your "NA" in your example is as much a string as my "a" – Merijn van Tilborg Dec 24 '21 at 19:16
  • paste0(x) means you have to think twice before doing that, as there is no way back, actually as you mentioned yourself :) I skipped that part to easy I guess trying to convince you the attempts doing so will fail. Why not check if it is already a vector prior to pasting it into a vector and lose the information of a string named NA versus a true value NA. It is actually the same problem you will create by doing as.numeric(c("1", 2)) as after that you never will be able to create back the one that was a string and which one was not. – Merijn van Tilborg Dec 24 '21 at 19:33