0

I'm using R and I am trying to delete some rows from a data frame based on some constrains. so, if I got

dat <- data.frame(Cs=c("c1","c2","c3","c4","c5","c6"),  
  R1=sample(c("Y","N"),6,replace=TRUE), R2=sample(c("Y","N"),6,replace=TRUE),
  R3=sample(c("Y","N"),6,replace=TRUE), R4=sample(c("Y","N"),6,replace=TRUE),
  R5=sample(c("Y","N"),6,replace=TRUE), R6=sample(c("Y","N"),6,replace=TRUE))

I'd like to delete all the rows having a "N" at some given columns such as R1, R3, R4. For one single column, I found this solution: delete row for certain constrains

d <- dat[dat[,"R1"]!="N",]

which works fine. but if I put multiple columns as

d <- dat[dat[,c("R1","R3","R4")]!="N",]

I got lots of extra rows full of NA. So where am I wrong?

Community
  • 1
  • 1
Stefano
  • 361
  • 1
  • 4
  • 21

2 Answers2

1

You can use

dat[rowSums(dat[, c("R1","R3","R4")] == "N") == 0, , drop=FALSE]
#  Cs R1 R2 R3 R4 R5 R6
#5 c5  Y  Y  Y  Y  Y  Y

Or, if you don't like excessive typing:

dat[!rowSums(dat[c('R1','R3','R4')]=='N'),]

This will first test each "cell" of columns "R1", "R3" and "R4" of your data whether it is equal to "N" and then calculate the sums of TRUE values per row. If no "N" is present in a row, the sum is equal to 0 and will be kept. I added drop=FALSE to keep the structure as a data.frame.

Note after a comment by OP:

If you subset only 1 column of a data.frame without specifying a drop=TRUE option, the default behavior of [.data.frame is to coerce the resulting 1-column-data.frame to an atomic vector. Then, rowSums wouldn't work on that resulting vector. To avoid that, change your code to:

dat[!rowSums(dat[,'R1', drop=FALSE]=='N'), ] 

Sample data:

set.seed(5) 
dat <- data.frame(Cs=c("c1","c2","c3","c4","c5","c6"),  
                  R1=sample(c("Y","N"),6,replace=TRUE), R2=sample(c("Y","N"),6,replace=TRUE),
                  R3=sample(c("Y","N"),6,replace=TRUE), R4=sample(c("Y","N"),6,replace=TRUE),
                  R5=sample(c("Y","N"),6,replace=TRUE), R6=sample(c("Y","N"),6,replace=TRUE))
talat
  • 68,970
  • 21
  • 126
  • 157
  • How would you incorporate testing only specific columns? Could you do `dat[rowSums(dat[,c("R1","R3","R4")]== "N") == 0, , drop=FALSE]`? – Heroka Nov 27 '15 at 16:42
  • Thanks a lot! short and clean solution and it works. just one remark. you say "if the sum is equal to 0, the row will be kept" but I guess it's the opposite, isn't it? – Stefano Nov 27 '15 at 16:54
  • @Stefano, no, it is as I said. If the number of "N"s in a row (and the specified columns) is 0, the row is kept. You can look at the row sums of my example: `rowSums(dat[c('R1','R3', 'R4')]=='N') #[1] 1 3 2 1 0 2`. You see that in the fifth row, there are no "N"s and that is the row that is kept in the example – talat Nov 27 '15 at 16:57
  • @docendodiscimus yes sorry. my bad! I got "if N is present" and not "if no N"... thanks again! – Stefano Nov 27 '15 at 17:00
  • You could also add the `dat[!Reduce('+',lapply(dat[c('R1','R3', 'R4')],'==', 'N')),]` – akrun Nov 27 '15 at 17:20
  • Yes! actually it might work even better as by using the other it crashes with a single column, while this not. thanks! – Stefano Nov 27 '15 at 18:13
  • @Stefano, just change the code to `dat[!rowSums(dat[,'R1', drop=FALSE]=='N'),]`. `drop=FALSE` is required if you subset only 1 column of a `data.frame` and you want to keep it as a `data.frame` (instead of converting it to an atomic vector). – talat Nov 27 '15 at 18:25
0

You could make a 'keep'-variable consisting of booleans for each row:

keep <- apply(dat[,c("R1","R3","R4")],
                  MARGIN=1,
                  FUN=function(x){all(x!='N')})
res <- dat[keep,]

> res
  Cs R1 R2 R3 R4 R5 R6
1 c1  Y  Y  Y  Y  Y  Y

data: seed used: 1234

dat <- structure(list(Cs = structure(1:6, .Label = c("c1", "c2", "c3", 
"c4", "c5", "c6"), class = "factor"), R1 = structure(c(2L, 1L, 
1L, 1L, 1L, 1L), .Label = c("N", "Y"), class = "factor"), R2 = structure(c(2L, 
2L, 1L, 1L, 1L, 1L), .Label = c("N", "Y"), class = "factor"), 
    R3 = structure(c(2L, 1L, 2L, 1L, 2L, 2L), .Label = c("N", 
    "Y"), class = "factor"), R4 = structure(c(1L, 1L, 1L, 1L, 
    1L, 1L), .Label = "Y", class = "factor"), R5 = structure(c(2L, 
    1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), 
    R6 = structure(c(2L, 2L, 2L, 1L, 2L, 1L), .Label = c("N", 
    "Y"), class = "factor")), .Names = c("Cs", "R1", "R2", "R3", 
"R4", "R5", "R6"), row.names = c(NA, -6L), class = "data.frame")
Heroka
  • 12,889
  • 1
  • 28
  • 38