Let's consider the following variable:
y <- factor(5:1, levels = 1:5, labels <- c(1:4, NA))
What's the best way to select all values that don't have label NA?
> !is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE
One could use factor values but it is cumbersome to record what's the value of NA:
> as.integer(y)
[1] 1 2 3 4 5
> as.integer(y) == which(is.na(levels(y)))
[1] FALSE FALSE FALSE FALSE TRUE
Conversion to character seems to work but this seems computationally suboptimal:
> as.character(y)
[1] "1" "2" "3" "4" NA
> is.na(as.character(y))
[1] FALSE FALSE FALSE FALSE TRUE
Any other (efficient) ideas that are easy to handle?
NOTE: This question is specifically about dealing with NAs as factor levels. It's not about selecting values in general. As it turns out, the most general methods handles this the expected way though -- see the comments below.
NOTE2: The use of y[!y %in% NA]
seems to work because of the way %in%
works. From the docs: "Factors, raw vectors and lists are converted to character vectors." I.e. the use of %in%
actually is equivalent to the as.character
-based approach above. This conversion should be avoided though -- which is the problem posed in this question.
BTW here is a minor microbenchmark of the approaches described above:
library(microbenchmark)
y <- factor(rep(5:1, 1000000), levels = 1:5, labels <- c("foo", "foo bar", "foobar bar", "foo foobar", NA))
microbenchmark(
as.integer(y) == which(is.na(levels(y))),
is.na(as.character(y)),
y %in% NA,
is.na(levels(y)[y]),
## times = 1e5,
times = 100,
check = function (values) {
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
)
Unit: milliseconds
expr min median max
as.integer(y) == which(is.na(levels(y))) 8.566085 15.92278 46.45769
is.na(as.character(y)) 24.554066 29.05405 58.22167
y %in% NA 58.836131 64.57089 104.53393
is.na(levels(y)[y]) 29.748583 34.27200 131.22975
So the best thing probably is to wrap up the first approach as a function I suppose. The differences aren't big though. Unfortunately, microbenchmark()
doesn't return any information on memory use.
Regards Tom