8

I am subsetting a dataframe according to multiple criteria across several columns. I am choosing the rows in the dataframe that contain any one of several values defined in the vector "criteria" in any one of three different columns.

I have some code that works, but wonder what other (more elegant?) ways there are to do this. Here is what I've done:

criteria <-c(1:10)
subset1 <-subset(data, data[, "Col1"] %in% criteria | data[, "Col2"]
 %in% criteria | data[, "Col3"] %in% criteria)

Suggestions warmly welcomed. (I am an R beginner, so very simple explanations about what you are suggesting are also warmly welcomed.)

user1257313
  • 1,057
  • 4
  • 11
  • 10

2 Answers2

14

I'm not sure if you need two apply calls here:

# Data
df=data.frame(x=1:4,Col1=c(11,12,3,13),Col2=c(9,12,10,13),Col3=c(9,13,42,23))
criteria=1:10

# Solution
df[apply(df [c('Col1','Col2','Col3')],1,function(x) any(x %in% criteria)),]

Unless you want to do a lot of columns, then it is probably more readable to say:

subset(df, Col1 %in% criteria | Col2 %in% criteria | Col3 %in% criteria) 
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • note that the man page for `subset` cautions against using it programatically, as it uses non standard evaluation – richiemorrisroe Mar 10 '12 at 10:54
  • 2
    I read that caution on the help page, but I don't know what it means. Why would "non-standard evaluation" be a problem? In my case, I am just sectioning up data to create some descriptive statistics, so I don't think "subset" will be cause issues for me... but in what cases would it cause issues? Thanks. – user1257313 Mar 10 '12 at 20:04
6

I'm using DF rather than data as the example.

DF[apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), 
               c(1,2), `%in%`, criteria), 
         1, any),]

For a breakdown of what this is doing:

Make a matrix of the specified columns, and for each element in that matrix test if it contains one of the criteria. Then for each row of that matrix, see if any of the row elements are TRUE. If so, keep the corresponding row of the original dataset.

Working through an example:

Start with dummy data:

DF <- data.frame(Col1=seq(1, by=2, length=10),
                 Col2=seq(3, by=3, length=10),
                 Col3=seq(7, by=1, length=10),
                 other=LETTERS[1:10])

which looks like

> DF
   Col1 Col2 Col3 other
1     1    3    7     A
2     3    6    8     B
3     5    9    9     C
4     7   12   10     D
5     9   15   11     E
6    11   18   12     F
7    13   21   13     G
8    15   24   14     H
9    17   27   15     I
10   19   30   16     J

Pull out just the columns of interest.

> as.matrix(DF[c("Col1","Col2","Col3")])
      Col1 Col2 Col3
 [1,]    1    3    7
 [2,]    3    6    8
 [3,]    5    9    9
 [4,]    7   12   10
 [5,]    9   15   11
 [6,]   11   18   12
 [7,]   13   21   13
 [8,]   15   24   14
 [9,]   17   27   15
[10,]   19   30   16

Check each entry versus the criteria

> apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria)
       Col1  Col2  Col3
 [1,]  TRUE  TRUE  TRUE
 [2,]  TRUE  TRUE  TRUE
 [3,]  TRUE  TRUE  TRUE
 [4,]  TRUE FALSE  TRUE
 [5,]  TRUE FALSE FALSE
 [6,] FALSE FALSE FALSE
 [7,] FALSE FALSE FALSE
 [8,] FALSE FALSE FALSE
 [9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE

Test if any of the values in a row are TRUE

> apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria), 1, any)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Use that to index the original data frame.

> DF[apply(apply(as.matrix(DF[c("Col1","Col2","Col3")]), c(1,2), `%in%`, criteria), 1, any),]
  Col1 Col2 Col3 other
1    1    3    7     A
2    3    6    8     B
3    5    9    9     C
4    7   12   10     D
5    9   15   11     E
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188