Conditionally subset data frame in R

Question

I have a data frame that has 10 columns and 510 rows. I'm trying to create a subset of it wherein if the row sum of the first 5 columns equals 0, the entire row is discarded. I've read posts on this site saying that you can't simply delete rows in R, so I've tried the following:

    data_sub <- data[!sum(data[, 1:5]==0), ]

However, data_sub ends up being a copy of data... and I'm really not sure why... Please advise! This data frame has no Inf or NaN values, only integers.

Please fix your syntax and provide a reproducible example- something that we can cut and paste and experiment with. — Michael Tuchman, Oct 25 '19 at 19:10
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Oct 25 '19 at 19:10
Why is it necessary to delete? Why not just work with the subset? — Michael Tuchman, Oct 25 '19 at 19:11
Check the grouping of your parentheses. The answers below show some alternate forms that read better. What yours does is is not exactly what you say it does. Yours checks the first 5 columns, makes sure ALL columns are non-zero, sums those boolean columns. I don' think that's what you were thinking. — Michael Tuchman, Oct 25 '19 at 19:39

slava-kohut · Accepted Answer · 2019-10-25T19:17:28.930

0

Try the following:

ind <- apply(data, 1, function(x) sum(x[1:5]) != 0)
data_sub <- data[ind, ]

or

data_sub <- data[rowSums(data[,1:5]) != 0, ]

edited Oct 25 '19 at 19:17

answered Oct 25 '19 at 19:12

slava-kohut

4,203
1
7
24

1

@EmilyLauren No problem. Consider upvoting or accepting if you like the answer. – slava-kohut Oct 25 '19 at 19:21
1

slava's answer is most succinct, but we can also be defensive adding an na.rm = TRUE to the rowSums function: df[rowSums(df[,1:5], na.rm = TRUE) != 0, ] – hello_friend Oct 26 '19 at 07:44
@hello_friend agreed – slava-kohut Oct 26 '19 at 12:59

Michael Tuchman · Answer 2 · 2019-10-25T19:35:13.253

This is what you want

reprex[sum(reprex[,1:5])!=0,]

returns a data set meeting your criteria. This applies to arrays or data frames. Notice however, that the original HAS NOT CHANGED, nor should it.

In the future, consider including a reproducible example as the one in the code below. It doesn't have to be complex, but I think you'll find the act of making one will clarify your thinking. It does for me!

# emily example

# sample column as a 50% chance of being zero and 50 percent chance of random 
set.seed(152)
sample_column<-function(col_length) {
  ifelse(runif(col_length)<0.5,0,runif(col_length))
}

# produce some columns of random numbers.  Spike it with 
# zeroes to make the filter actually catch some.

make_reprex<-function(nrows,ncols) {
  id=1:nrows
  colnames=paste0('x',1:ncols)
  data=matrix(nrow=nrows,ncol=ncols)
  rownames(data)=id
  colnames(data)=colnames
  for (j in 1:ncols) {
    data[,j]=sample_column(nrows)
  }
  return(data)
}

reprex=make_reprex(510,15)
# desired expression 
reprex[sum(reprex[,1:5]!=0),]

If you wish to subset the data as though in place, you'll need to make another assignment.

reprex=reprex[sum(reprex[,1:5]!=0),]

I advise against this kind of in-place substitution. There are some cases where it is necessary, but rarely as often as you might think.

reason?

If you avoid destructive subsetting, and something goes wrong, you can easily return to the data frame as you originally loaded it.

Conditionally subset data frame in R

2 Answers2

reason?