Why are NAs in newly created data.frame when using logical selection?

Question

I'm trying to get rid of NAs in an R data.frame. I was trying to create a new df that included only rows whose cluster was "texas" in this example.

> newdf <- df[df$cluster == "texas",]
> summary(newdf$cluster)
     texas    oklahoma            NA's 
      510           0             719

I had found other questions that address getting rid of NAs, but in this case, I was only selecting those whose "cluster" column is equal to "texas" -- how did an NAs come along for the ride?

Is there a better way of doing what I want?

You can do `df[which(df$cluster == "texas"),]` – MrFlick Dec 09 '15 at 23:39 — MrFlick, Dec 09 '15 at 23:39

score 3 · Accepted Answer · edited May 23 '17 at 12:24

As @MrFlick suggests above, NA values are handled in slightly (subtly?) different ways depending on how you index.

Test data:

dd <- data.frame(cluster=c("oklahoma","texas",NA))

logical indexing: a TRUE value in the index vector selects the corresponding value, FALSE drops it, and NA results in NA.

dd$cluster=="oklahoma"
## [1] TRUE FALSE NA
summary(dd[dd$cluster=="oklahoma",])
## oklahoma    texas     NA's 
##        1        0        1

In principle you could use dd$cluster=="oklahoma" & !is.na(dd$cluster) as your criterion - since FALSE & NA is FALSE - but that's rather awkward. (Since we have specified a single-column data frame, without saying drop=FALSE, the result gets simplified to a vector before being summarized.)

subset: although it is sometimes deprecated for non-interactive use, subset has the convenient property that it drops values where the criterion evaluates to NA. (Also, subset always returns a data frame even if the result is only one column wide.)

summary(subset(dd,cluster=="oklahoma"))
##      cluster 
##  oklahoma:1  
##  texas   :0

which:

which() only returns indices for TRUE values, not for NA values:

which(dd$cluster=="oklahoma")
## [1] 1
summary(dd[which(dd$cluster=="oklahoma"),])
## oklahoma    texas 
##        1        0

Why are NAs in newly created data.frame when using logical selection?

1 Answers1