-1

Usually, if I want to subset a dataframe conditioning of some values a variable I'm using subset and %in%:

x <- data.frame(u=1:10,v=LETTERS[1:10])
x
subset(x, v %in% c("A","D"))

Now, I found out that also == gives the same result:

subset(x, v == c("A","D"))

I'm just wondering if they are identically or if there is a reason to prefere one over the other. Thanks for help.

Edit (@MrFlick): This question asks not the same as this here which asks how to not include several values: (!x %in% c('a','b')). I asked why I got the same if I use ==or %in%.

Community
  • 1
  • 1
giordano
  • 2,954
  • 7
  • 35
  • 57
  • 2
    you want `%in%`, `==` only works here because of a lucky coincidence between recycling rules and the length of both vectors. Consider `1:10 == c(1,3)` to convince yourself that it's a coincidence. – baptiste Nov 07 '14 at 16:08
  • 1
    You almost exclusively could use `%in%` (and `==` for special occasions) – rawr Nov 07 '14 at 16:48

1 Answers1

3

You should use the first one %in% because you got the result only because in the example dataset, it was in the order of recycling of A, D. Here, it is comparing

rep(c("A", "D"), length.out= nrow(x))
# 1] "A" "D" "A" "D" "A" "D" "A" "D" "A" "D"

 x$v==rep(c("A", "D"), length.out= nrow(x))# only because of coincidence
 #[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE


subset(x, v == c("D","A"))
#[1] u v
#<0 rows> (or 0-length row.names)

while in the above

 x$v==rep(c("D", "A"), length.out= nrow(x))
 #[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

whereas %in% works

subset(x, v %in% c("D","A"))
#  u v
#1 1 A
#4 4 D
akrun
  • 874,273
  • 37
  • 540
  • 662