13

I am trying to select from a data frame. The question is why I the last query below returns all 5 records not jsut the first two?

> x <- c(5,1,3,2,4)
> y <- c(1,5,3,4,2)
> data <- data.frame(x,y)
> data
  x y
1 5 1
2 1 5
3 3 3
4 2 4
5 4 2
> data[data$x > 4 || data$y > 4]
  x y
1 5 1
2 1 5
3 3 3
4 2 4
5 4 2
fatdragon
  • 2,211
  • 4
  • 26
  • 43

3 Answers3

23

(1) For select data (subset), I highly recommend subset function from plyr package written by Hadley Wickhm, it is cleaner and easy to use:

library(plyr)
subset(data, x > 4 | y > 4)

UPDATE:

There is a newer version of plyr called dplyr (here) which is also from Hadley, but supposedly way faster and easier to use. If you have ever seen operatior like %.% or %>%, you know they are chaining the operations using dplyr.

result <- data %>%
          filter(x>4 | y>4)  #NOTE filter(condition1, condition2..) for AND operators.

(2) There indeed exist some differences between | and ||:

You can look at the help manual by doing this: ?'|'

The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.

> c(1,1,0) | c(0,0,0)
[1]  TRUE  TRUE FALSE
> c(1,1,0) || c(0,0,0)
[1] TRUE

Per your question, what you did is basically data[TRUE], which ...will return the complete dataframe.

Kim
  • 4,080
  • 2
  • 30
  • 51
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
  • 4
    Are you sure `subset` is from `plyr`? – David Arenburg Mar 17 '16 at 12:27
  • @DavidArenburg, I think the name space of R is confusing sometimes and I am pretty confident there is a subset function in dplyr, probably plyr, and there might also be a subset function from the base package.. – B.Mr.W. Mar 17 '16 at 19:52
  • No `subset` function neither in plyr or dplyr. There is method for `data.table` though if that what was confusing you. – David Arenburg Mar 17 '16 at 19:54
  • @DavidArenburg, besides what you say; `subset` is suggested to be used only in the interactive use; `[`, `[[` subsetters are suggested. Hence, `subset` may not be suitable here as well since it cannot handle `NA`s in a dataframe properly in filtering. – Erdogan CEVHER Jul 31 '18 at 06:21
  • EDIT: Though `subset` is suggested to be used only in the interactive use and `[`, `[[` subsetters are suggested; the above solution IS robust to the existence of `NA`s in the dataframe: `vc <- data.frame(duzey=factor(c("Y","O","Y","D","Y","Y","O"), levels=c("D","O","Y"), ordered=TRUE), cinsiyet=c("E","E","K",NA,"K","E","K"), yas=c(8,3,9,NA,7,NA,6), Not=c(NA,1,1,NA,NA,2,1)); vc; subset(vc, cinsiyet=="E" | Not<4); subset(vc, cinsiyet=="E" & Not<2)`. Hence, with ability to handle dataframes with `NA`s, this solution is good. – Erdogan CEVHER Jul 31 '18 at 08:18
  • Why does | seem to work like "and" inside filter()? For example, "filter if both conditions are true," shouldn't that use "condition 1 & condition 2"? Yet it treats that as "if condition 1 is true and if condition 2 is true" but I need both to be true, not just one. – aegon Dec 22 '20 at 21:10
5

Here's something that works for me.

data[data[,1] > 4 | data[,2] > 4,1:2]

I'm not sure exactly why your method isn't working but I think it is because you're not telling it when not to print. Look at help("[").

CCurtis
  • 1,770
  • 3
  • 15
  • 25
  • This solution - which is isomorphically equivalent to 3-voted `data[data$x > 4 | data$y > 4,]` solution - is NOT robust to the existence of `NA`s in the dataframe: `vc <- data.frame(duzey=factor(c("Y","O","Y","D","Y","Y","O"), levels=c("D","O","Y"), ordered=TRUE), cinsiyet=c("E","E","K",NA,"K","E","K"), yas=c(8,3,9,NA,7,NA,6), Not=c(NA,1,1,NA,NA,2,1)); vc; vc[vc[,2] == "E" | vc[,4] < 4,1:4]; vc[vc[,2] == "E" & vc[,4] < 2,1:4]`. – Erdogan CEVHER Jul 31 '18 at 08:28
4

Taking your exact code and modifying it slightly

> x <- c(5,1,3,2,4)
> y <- c(1,5,3,4,2)
> data <- data.frame(x,y)
> data[data$x > 4 | data$y > 4,]
  x y
1 5 1
2 1 5

There are two important things to note. One is that the || has been changed to |, and the second is that there is an additional comma (,) just before the last square bracket this allows the filter to work properly.

hdost
  • 883
  • 13
  • 22
  • 1
    This solution is NOT robust to the existence of `NA`s in the dataframe: `vc <- data.frame(duzey=factor(c("Y","O","Y","D","Y","Y","O"), levels=c("D","O","Y"), ordered=TRUE), cinsiyet=c("E","E","K",NA,"K","E","K"), yas=c(8,3,9,NA,7,NA,6), Not=c(NA,1,1,NA,NA,2,1)); vc; vc[vc$cinsiyet == "E" | vc$Not < 4,]; vc[vc$cinsiyet == "E" & vc$Not < 2,]` – Erdogan CEVHER Jul 31 '18 at 08:10