2

R (version 3.3.3) is giving me some unexpected behavior when subsetting a data frame on a condition based on a character column. Here is an example:

foo <- data.frame(bar = c('a',NA,'b','a'),
                  baz = 1:4,
                  stringsAsFactors = FALSE)

foo looks like this:

   bar baz
1    a   1
2 <NA>   2
3    b   3
4    a   4

I want to get all rows of this data frame where bar != "a", so I call:

foo[foo$bar != 'a', ]

This returns:

    bar baz
NA <NA>  NA
3     b   3

I do not understand why the first entry in the second column is NA and not 2. Please help me explain this strange behavior.

qdread
  • 3,389
  • 19
  • 36

1 Answers1

2

While I'm trying to understand the behaviour, the right/better way to do character filter in R is to use %in% operator.

foo <- data.frame(bar = c('a',NA,'b','a'),
                  baz = 1:4,
                  stringsAsFactors = FALSE)

foo[!(foo$bar %in% 'a'), ]

Output:

> foo[!(foo$bar %in% 'a'), ]
   bar baz
2 <NA>   2
3    b   3

Update:

The behaviour isn't because of character filter. It's actually because NA is used to index the dataframe.

> foo[c(F,NA,T,F),]
    bar baz
NA <NA>  NA
3     b   3

Passing NA as index value replaces any value in that position with just NA

> foo[NA,]
      bar baz
NA   <NA>  NA
NA.1 <NA>  NA
NA.2 <NA>  NA
NA.3 <NA>  NA
> foo[c(T,NA),]
      bar baz
1       a   1
NA   <NA>  NA
3       b   3
NA.1 <NA>  NA
amrrs
  • 6,215
  • 2
  • 18
  • 27