0

My data.frame (df) consists of 20 different columns.
All my columns are integer values (range 0 - 99)

Let's say I would like to subset my data of col1 and col2 that have lower values(x) than 4.
So my code could be:

df2 <- subset(df, col1 < 4 & col2 < 4)

That's fine.

But how can I modify my code to get a new subset of all my 20 columns. without specifying every particular column. ?

Thanks for your help!

ChrisM
  • 1,576
  • 6
  • 18
  • 29
  • 1
    It would be easier to help you if you provided a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data that can be used to test and verify possible solutions. – MrFlick Aug 28 '17 at 18:09
  • @MrFlick correct -> I'm looking for an efficient way to code cases like that from my example –  Aug 28 '17 at 18:16
  • @G5W when the data have `NA` will create the problem ~ – BENY Aug 28 '17 at 18:23
  • @MrFlick You are right. I was mis-reading the question. – G5W Aug 28 '17 at 18:32

2 Answers2

2
df2 <- df[apply(df, 1, max) < 4,]
dvantwisk
  • 561
  • 3
  • 11
1

Here is a faster method than apply using max.col, matrix subsetting, and logical subsetting. First, construct a sample dataset.

set.seed(1234)
dat <- data.frame(a=sample(1:3, 5, replace=TRUE),
                  b=sample(1:4, 5, replace=TRUE),
                  c=sample(1:6, 5, replace=TRUE))

It looks like this.

dat
  a b c
1 1 3 5
2 2 1 4
3 2 1 2
4 2 3 6
5 3 3 2

Notice that only the third column has values greater than 4 and that only 2 such elements in the column pass the test. Now, we do

dat[dat[cbind(seq_along(dat[[1]]), max.col(dat))] > 4, ]
  a b c
1 1 3 5
4 2 3 6

Here, max.col(dat) returns the column with the maximum value for each row. seq_along(dat[[1]]) runs through the row numbers. cbind returns a matrix that we use to pull out the maximum value for each row using matrix subsetting. Then, compare these values to see if any are greater than 4 with > 4, which returns a logical vector whose length is the number of rows of the data.frame. This is used to subset the data.frame by row.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • great alternative! thank you very much for your answer and help! :) –  Aug 29 '17 at 07:43
  • @dvantwisk I upvoted your answer because it looks cleaner, but `apply` is notoriously slow, especially for data.frames. It is essentially a fancy wrapper for a `for` loop that has to convert the data.frame to a matrix prior to running that loop. `max.col`, while still requiring matrix conversion, is optimized under the hood, and matrix indexing (subsetting) is usually incredibly fast. – lmo Aug 29 '17 at 14:08