R - show only levels used in a subset of data frame

Question

I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.

The first step I'm using is subsetrows <- which(is.na(mydata$reference)) but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor]) but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?

As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.

maybe this could help http://stackoverflow.com/questions/1195826/dropping-factor-levels-in-a-subsetted-data-frame-in-r — NicE, Feb 17 '15 at 21:45
Could you try `levels(mydata$factor)[mydata$factor[subsetrows]]`? — Marat Talipov, Feb 17 '15 at 21:52
Well, using your previous idea of `unique`, this turns out to give me the right levels: `unique(as.character(mydata$factor[subsetrows]))` — Alium Britt, Feb 17 '15 at 22:10
In fact, `as.character.factor` is a wrap-up for `levels(x)[x]` — Marat Talipov, Feb 17 '15 at 22:16
@MaratTalipov - I think using `levels` will give the wrong answer as the length of `levels(x)` doesn't match what you are using to subset with necessarily. — thelatemail, Feb 17 '15 at 22:35
@thelatemail, I don't think so: subsetting `mydata$factor[subsetrows]` returns a subset of factors, i.e. numeric indices that serve as a shortcut for characters stored in levels, that is guaranteed to be within the length of `levels(x)` and is guaranteed to match the proper level. In fact, as I noted before, `as.character(mydata$factor[subsetrows])` calls `as.character.factor`, whose definition is `function(x) levels(x)[x]`. Thus, my solution and the one proposed by Alium (which I actually like more because of its compactness) are essentially the same thing — Marat Talipov, Feb 17 '15 at 22:44

score 2 · Answer 1 · answered Feb 17 '15 at 21:45

2

You can certainly accomplish this with base functions. But my personal preference is to use dplyr with chained operations such as this:

library(dplyr)

d %>%
  filter(is.na(ref)) %>%
  select(field) %>%
  distinct()

data

d <- data.frame(
  field = c("A", "B", "C", "A", "B", "C"),
  ref = c(NA, "a", "b", NA, "c", NA)
  )

answered Feb 17 '15 at 21:45

davechilders

8,693
2
18
18

What does that %>% operator do? – Alium Britt Feb 17 '15 at 21:50
It's a forward piping operator from [magrittr](https://github.com/smbache/magrittr). Basically `x %>% f()` is equivalent to `f(x)`. – Lincoln Mullen Feb 17 '15 at 21:51
So ... that would be the equivalent of putting the `d$ref` in the `filter` line, then putting the result in the `select` line, and then the result of that in the `distinct` line? – Alium Britt Feb 17 '15 at 21:57

Alium Britt · Accepted Answer · 2015-02-18T07:12:22.153

I modified a suggestion in the comments by Marat to use the function unique that seems to return the correct levels.

Solution:

subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))

While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.

R - show only levels used in a subset of data frame

2 Answers2