I learned this approach to calculating values for a new column conditional on the values in an existing column. I actually picked up this one and some other incredibly handy tips from an earlier post: What is the most useful R trick?.
mydf <- expand.grid(var1 = c('type1', 'type2', 'type3'), var2 = c(1, 2, 3))
mydf$var3 <- rnorm(dim(mydf)[1], mean=90, sd=10)
mydf$column2[mydf$var3 > 90] <- "big" #now my conditional replacement
Works great, but there was a worrisome comment that "There is a small trap here [etc...]. If df$column1 contains NA values, subsetting using == will pull out any values that equal x and any NAs. To avoid this, use "%in%" instead of "==". And another comment to avoid this using na.omit. However, I did not observe this behavior:
mydf <- expand.grid(var1 = c('type1', 'type2', 'type3'), var2 = c(1, 2, 3))
mydf$var3 <- rnorm(dim(mydf)[1], mean=90, sd=10)
mydf$var3[3] <- 90
mydf$var3[4] <- NA
is.na(mydf$var3[4]) # True!
mydf$column4[mydf$var3 == 90] <- "exactly 90!" # possible unintended behavior w/ row 4?
mydf$column4[mydf$var3 > 90] <- "big"
mydf # if there is a trap shouldn't mydf$column4[4] == "exactly 90!" ?
Of course I am interested in coding correctly and avoiding any possible mistake, but could not figure out how to use na.omit to explicitly assign NA to rows where there is an NA in var3, in the same way we did for the other logical conditions, like var3 == 90. Questions:a) why did I not see the unintended matching that we were warned about, b) how would I code to explicitly avoid this using is.na, c) are there any other unexpected behaviors to be aware of with this approach?