39

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:

data
v1 v2 v3 v4
a  v  d  c
a  v  d  d
b  n  p  g
b  d  d  h    
c  k  d  c    
c  r  p  g
d  v  d  x
d  v  d  c
e  v  d  b
e  v  d  c

For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:

v1 v2 v3 v4
a  v  d  c
a  v  d  d
c  k  d  c    
c  r  p  g

I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":

sub.data <- data[data[ , 1] != "b", ]

However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:

sub.data <- data[data[ , 1] != c("b", "d", "e")

or

sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))

I've tried some other things as well, like !%in%, but that doesn't seem to exist. Any ideas?

Henrik
  • 65,555
  • 14
  • 143
  • 159
Jota
  • 17,281
  • 7
  • 63
  • 93

8 Answers8

49

Try this

subset(data, !(v1 %in% c("b","d","e")))
chl
  • 27,771
  • 5
  • 51
  • 71
  • Nice and simple, thanks. I'm not sure which solution I like better, this one or the one provided by Andrie. They are both easy and effective. All three solutions work for me, and I have never used `which()` before. So, it was nice to be introduced to that function. – Jota Jun 05 '11 at 17:25
  • 9
    If it helps you to make up your mind as to whether to use `subset` or `[`, have a look at the warning in the help for `?subset`: *"This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences."* – Andrie Jun 06 '11 at 12:45
  • @Andrie Thanks for adding clarification. – chl Jun 06 '11 at 12:55
42

The ! should be around the outside of the statement:

data[!(data$v1 %in% c("b", "d", "e")), ]

  v1 v2 v3 v4
1  a  v  d  c
2  a  v  d  d
5  c  k  d  c
6  c  r  p  g
Andrie
  • 176,377
  • 47
  • 447
  • 496
10

You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.

subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")

This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.

Jota
  • 17,281
  • 7
  • 63
  • 93
N Brouwer
  • 4,778
  • 7
  • 30
  • 35
  • 1
    shouldn't those be `|` rather than `&` ? – Ben Bolker Apr 09 '14 at 13:08
  • @BenBolker If you change to `|`, you get the same data as were put in. – Jota Jul 04 '14 at 15:10
  • 1
    @Frank Can you explain the logic of `&` paired with `!=` here? Like Ben, it seems like `|` should be used, but you're right that it shouldn't. I'm especially confused about subsetting multiple columns that way. For example, using Herman's sample data above, to remove all cases of "b" from v1 and all of "n" from v2, I would think that `my.df[my.df$v1 != "b" & my.df$v2 != "n",]` would only remove cases that met both of those criteria (i.e. only Row 3), rather than either of those criteria (i.e. both Row 3 and Row 4). In fact, using `|` with `!=` does what I expect `&` to do, but I don't get why. – coip Feb 12 '15 at 16:45
  • With `|` a single `TRUE` result among any of the conditions will cause the whole statement to evaluate to `TRUE`. All the conditions must evaluate to `FALSE` for the statement to evaluate to `FALSE`. With `&` a single `FALSE` condition will make the whole statement evaluate to `FALSE`. If you want to use or, you can use exclusive or: `xor` like so: `subset(my.df, xor(xor(my.df$v1 != "b", my.df$v1 != "d"), my.df$v1 != "e"))`. – Jota Feb 13 '15 at 00:11
5
data <- data[-which(data[,1] %in% c("b","d","e")),]
Dason
  • 60,663
  • 9
  • 131
  • 148
paul c
  • 51
  • 1
  • 1
5

This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:

> 1:3 == 1:3
[1] TRUE TRUE TRUE

Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:

> 3:1 == 1:3
[1] FALSE  TRUE FALSE

Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:

>  1:2 == 1:3
[1]  TRUE  TRUE FALSE
Warning message:
In 1:2 == 1:3 :
  longer object length is not a multiple of shorter object length

Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:

> 1:3 == 1
[1]  TRUE FALSE FALSE

The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).

Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:

> 1:3 == 1 & 1:3 != 2
[1]  TRUE FALSE FALSE
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
3
my.df <- read.table(textConnection("
v1 v2 v3 v4
a  v  d  c
a  v  d  d
b  n  p  g
b  d  d  h    
c  k  d  c    
c  r  p  g
d  v  d  x
d  v  d  c
e  v  d  b
e  v  d  c"), header = TRUE)

my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]

  v1 v2 v3 v4
1  a  v  d  c
2  a  v  d  d
5  c  k  d  c
6  c  r  p  g
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1
sub.data<-data[ data[,1] != "b"  & data[,1] != "d" & data[,1] != "e" , ]

Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).

Toribio
  • 3,963
  • 3
  • 34
  • 48
Hernan
  • 11
  • 1
1

And also

library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))

or

data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")

or

data %>% filter(v1 != "b", v1 != "d", v1 != "e")

Since the & operator is implied by the comma.

Joe
  • 8,073
  • 1
  • 52
  • 58