1

I'm trying to get rid of NAs in an R data.frame. I was trying to create a new df that included only rows whose cluster was "texas" in this example.

> newdf <- df[df$cluster == "texas",]
> summary(newdf$cluster)
     texas    oklahoma            NA's 
      510           0             719 

I had found other questions that address getting rid of NAs, but in this case, I was only selecting those whose "cluster" column is equal to "texas" -- how did an NAs come along for the ride?

Is there a better way of doing what I want?

Community
  • 1
  • 1
JBWhitmore
  • 11,576
  • 10
  • 38
  • 52

1 Answers1

3

As @MrFlick suggests above, NA values are handled in slightly (subtly?) different ways depending on how you index.

Test data:

dd <- data.frame(cluster=c("oklahoma","texas",NA))
  1. logical indexing: a TRUE value in the index vector selects the corresponding value, FALSE drops it, and NA results in NA.
dd$cluster=="oklahoma"
## [1] TRUE FALSE NA
summary(dd[dd$cluster=="oklahoma",])
## oklahoma    texas     NA's 
##        1        0        1 

In principle you could use dd$cluster=="oklahoma" & !is.na(dd$cluster) as your criterion - since FALSE & NA is FALSE - but that's rather awkward. (Since we have specified a single-column data frame, without saying drop=FALSE, the result gets simplified to a vector before being summarized.)

  1. subset: although it is sometimes deprecated for non-interactive use, subset has the convenient property that it drops values where the criterion evaluates to NA. (Also, subset always returns a data frame even if the result is only one column wide.)
summary(subset(dd,cluster=="oklahoma"))
##      cluster 
##  oklahoma:1  
##  texas   :0  
  1. which:

which() only returns indices for TRUE values, not for NA values:

which(dd$cluster=="oklahoma")
## [1] 1
summary(dd[which(dd$cluster=="oklahoma"),])
## oklahoma    texas 
##        1        0 
Community
  • 1
  • 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453