1

I learned this approach to calculating values for a new column conditional on the values in an existing column. I actually picked up this one and some other incredibly handy tips from an earlier post: What is the most useful R trick?.

mydf <- expand.grid(var1 = c('type1', 'type2', 'type3'), var2 = c(1, 2, 3))
mydf$var3 <- rnorm(dim(mydf)[1], mean=90, sd=10)
mydf$column2[mydf$var3 > 90] <- "big" #now my conditional replacement

Works great, but there was a worrisome comment that "There is a small trap here [etc...]. If df$column1 contains NA values, subsetting using == will pull out any values that equal x and any NAs. To avoid this, use "%in%" instead of "==". And another comment to avoid this using na.omit. However, I did not observe this behavior:

mydf <- expand.grid(var1 = c('type1', 'type2', 'type3'), var2 = c(1, 2, 3))
mydf$var3 <- rnorm(dim(mydf)[1], mean=90, sd=10)
mydf$var3[3] <- 90
mydf$var3[4] <- NA
is.na(mydf$var3[4])  # True!
mydf$column4[mydf$var3 == 90] <- "exactly 90!"  # possible unintended behavior w/ row 4?
mydf$column4[mydf$var3 > 90] <- "big"
mydf  # if there is a trap shouldn't mydf$column4[4] == "exactly 90!" ?

Of course I am interested in coding correctly and avoiding any possible mistake, but could not figure out how to use na.omit to explicitly assign NA to rows where there is an NA in var3, in the same way we did for the other logical conditions, like var3 == 90. Questions:a) why did I not see the unintended matching that we were warned about, b) how would I code to explicitly avoid this using is.na, c) are there any other unexpected behaviors to be aware of with this approach?

Community
  • 1
  • 1
marcel
  • 389
  • 1
  • 8
  • 21

3 Answers3

0

I'm not exactly clear on what you're asking. If you could provide an example of how the fourth column should afterwards, that would definitely help.

But, I think na.pass() might work for you here. na.omit() removes all rows that contain at least one NA, and it doesn't seem like you need that here.

> np <- na.pass(mydf$var3)
#[1] 106.17409  88.48014  90.00000        NA  91.62274  91.75860  
#[7] 85.91689  91.06369 100.20514
> mydf$var4 <- ifelse(np > 90, "big", ifelse(np == 90, "exact", ""))
#[1] "big"   ""      "exact" NA      "big"   "big"   ""      "big"   "big" 
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
0

Yes and no. The trap is that when you subset a data.frame that NAs also get returned. But, you aren't quite doing that because the mydf$var3 == 90 returns a logical vector not a subset of the data frame, and any TRUE gets replaced by "exactly 90!" while False and NA do not.

mydf$var3 == 90
[1] FALSE FALSE  TRUE    NA FALSE FALSE FALSE FALSE FALSE
JeremyS
  • 3,497
  • 1
  • 17
  • 19
  • Great thanks for the explanation. Reassuring that R is behaving as I would expect in this case. – marcel Sep 01 '14 at 09:10
0

May be this helps. You could use cut with very narrow breaks for 90 (if that is allowed)

  mydf$var4 <- with(mydf,
        as.character(cut(var3, breaks=c(-Inf, 89.999999,90.0001, Inf), labels=c("", "exactly 90!", "big"))) )

  mydf
  #   var1 var2      var3        var4
  #1 type1    1 103.34752         big
  #2 type2    1  88.58128            
  #3 type3    1  90.00000 exactly 90!
  #4 type1    2        NA        <NA>
  #5 type2    2  72.37580            
  #6 type3    2  83.34518            
  #7 type1    3  96.28078         big
  #8 type2    3  88.91577            
  #9 type3    3  78.68584            
akrun
  • 874,273
  • 37
  • 540
  • 662