3

Let's consider the following variable:

 y <- factor(5:1, levels = 1:5, labels <- c(1:4, NA))

What's the best way to select all values that don't have label NA?

> !is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE

One could use factor values but it is cumbersome to record what's the value of NA:

> as.integer(y)
[1] 1 2 3 4 5
> as.integer(y) == which(is.na(levels(y)))
[1] FALSE FALSE FALSE FALSE  TRUE

Conversion to character seems to work but this seems computationally suboptimal:

> as.character(y)
[1] "1" "2" "3" "4" NA
> is.na(as.character(y))
[1] FALSE FALSE FALSE FALSE  TRUE

Any other (efficient) ideas that are easy to handle?

NOTE: This question is specifically about dealing with NAs as factor levels. It's not about selecting values in general. As it turns out, the most general methods handles this the expected way though -- see the comments below.

NOTE2: The use of y[!y %in% NA] seems to work because of the way %in% works. From the docs: "Factors, raw vectors and lists are converted to character vectors." I.e. the use of %in% actually is equivalent to the as.character-based approach above. This conversion should be avoided though -- which is the problem posed in this question.

BTW here is a minor microbenchmark of the approaches described above:

library(microbenchmark)
y <- factor(rep(5:1, 1000000), levels = 1:5, labels <- c("foo", "foo bar", "foobar bar", "foo foobar", NA))
microbenchmark(
        as.integer(y) == which(is.na(levels(y))),
        is.na(as.character(y)),
        y %in% NA,
        is.na(levels(y)[y]),
        ## times = 1e5,
        times = 100,
        check = function (values) {
            all(sapply(values[-1], function(x) identical(values[[1]], x)))
        }
    )

Unit: milliseconds
                                     expr       min   median       max
 as.integer(y) == which(is.na(levels(y)))  8.566085 15.92278  46.45769
                   is.na(as.character(y)) 24.554066 29.05405  58.22167
                                y %in% NA 58.836131 64.57089 104.53393
                      is.na(levels(y)[y]) 29.748583 34.27200 131.22975

So the best thing probably is to wrap up the first approach as a function I suppose. The differences aren't big though. Unfortunately, microbenchmark() doesn't return any information on memory use.

Regards Tom

lith
  • 929
  • 8
  • 23
  • 3
    `y[!y %in% NA]` ? – Ronak Shah Nov 22 '18 at 08:23
  • I can't find a cleaner way to do this. May we ask why you are working with factor _labels_ in this way? – Tim Biegeleisen Nov 22 '18 at 08:27
  • 1
    @RonakShah Interesting approach. I wasn't aware yet that `%in%` handles this situation that way. – lith Nov 22 '18 at 08:31
  • 1
    `y[!is.na(levels(y))]` – Andre Elrico Nov 22 '18 at 08:44
  • @RonakShah I wouldn't consider this question a duplicate of the question referenced above though because this question is specifically about dealing with NA as factor levels not about selecting values in general. – lith Nov 22 '18 at 09:11
  • @lith if you look at the linked question, `data$Code` is `factor` which is `y` in your case and `selected` is the value which we want to select which is `NA` in your case. Moreover, top 2 answer from the linked duplicate works in your case. If you disagree, you can vote to reopen the question. – Ronak Shah Nov 22 '18 at 09:18
  • @RonakShah Well, but this is not the problem described here. I'll add a second note on why your approach unfortunately isn't a solution either. – lith Nov 22 '18 at 09:36
  • @lith why is Andre's approach not an acceptable solution. You will need to at minimum inspect the levels of the vector to find the NA – Chris Nov 22 '18 at 14:24
  • @Chris Because it doesn't work. It seemingly works in this particular example but this is an artefact. Run the code with `y <- factor(5:1, levels = 1:5, labels <- c(1:4, NA))` to check. I'll amend the problem description accordingly. – lith Nov 22 '18 at 16:25

0 Answers0