21

Let's say I have the data frame Mydata as shown below:

Mydata <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                     y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))

I want to filter this data frame and create another data frame, so that only the values of x between 3 and 7 and their corresponding y values are shown. I attempted the following:

new_frame <- Mydata %>% filter(x == (3:7))

This didn't work. How then would I filter for a specified range?

Thanks in advance for all help

tk3
  • 990
  • 1
  • 13
  • 18
Paul Ibrahim
  • 413
  • 1
  • 3
  • 14
  • 7
    Since you're using `dplyr`, you can also use `dplyr::between`. `between(x, 3, 7)` is a shortcut for `x >= 3 & x <= 7`. Not needed if you have integers, but if you had decimal numbers `%in%` wouldn't work. – Gregor Thomas Jun 29 '18 at 19:46

6 Answers6

35

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)
 
 new_frame <- Mydata %>% filter(x %in% (3:7) )
 new_frame
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
 new_frame2
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97     

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE
MHammer
  • 1,274
  • 7
  • 12
  • You need a library call; without it, the typical R user is confronted with errors from stats::filter ... Either that, or someone should add the call to the OP (I was just judging by tags initially) – Frank Jun 29 '18 at 19:55
  • A word of caution: filter(x %in% (3:7) and filter( between(x, 3, 7) ) are only analogous if x contains whole numbers--filter(x %in% (3:7) won't return any numbers with decimals. – Hannah O. Jun 27 '22 at 09:08
  • 1
    @HannahO. Correct, which I already stated in my answer: `While %in% works great for integers, if you need to filter on floats, or ... use dplyr's between():`. Regardless, I have updated my answer to further clarify and address your comment. – MHammer Jun 28 '22 at 14:11
8

Plenty of good dplyr solutions such as filtering in or hard-coding the upper and lower bounds already present in some of the answers:

MydataTable%>% filter(between(x, 3, 70))
Mydata %>% filter(x %in% 3:7)
Mydata %>% filter(x>=3&x<=7)

You could also work with data.table, which is very fast for large data sets. inrange and between work identically for this purpose

library(data.table)
MydataTable <- data.table(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                          y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))
MydataTable[x %inrange% c(3,7)]
MydataTable[x %between% c(3,7)]

A benefit of this method (besides the speed of data.table) is that you only need to specify the min and max range - you are not creating an array to subset the filter.

A time comparison of these methods:

> df <- data.frame(x = sample(1:10, 10000000, replace = T),
+                      y = sample(1:10, 10000000, replace = T))
> system.time({ df %>% filter(between(x, 3, 7)) })
   user  system elapsed 
   0.18    0.05    0.14 
> system.time({ df %>% filter(x %in% 3:7) })
       user  system elapsed 
       0.19    0.06    0.29 
> system.time({ df %>% filter(x>=3&x<=7)  })
   user  system elapsed 
   0.17    0.09    0.26 

> dt <- data.table(df)
> system.time( {dt[x %inrange% c(3,7)] })
   user  system elapsed 
   0.13    0.07    0.21 
> system.time( {dt[x %between% c(3,7)] })
   user  system elapsed 
   0.18    0.05    0.13
cacti5
  • 2,006
  • 2
  • 25
  • 33
  • 3
    `dplyr::between` also works, `between(x, 3, 7)`. Since you're mentioning different methods, it would be useful to distinguish between them. `%in%`, `%between%`, and `%inrange%` all have different uses, they just happen to overlap in this case. And `{dt[x %in% 3:7]}` should also be compared in the benchmarking. (And single-run benchmarks can be quite variable, I'd recommend `microbenchmark::microbenchmark()` for timing purposes.) – Gregor Thomas Jun 29 '18 at 19:46
5

And the good old base::subset:

subset(Mydata, x >= 3 & x <= 7)
subset(Mydata, x %in% 3:7)
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
2

Piggybacking on @Anna 's answer, I just ran a few of the options to see which was faster on a larger dataset for a problem have at work. I used the setup from here (Faster way to subset on rows of a data frame in R?) I checked it on a 1 billion row (16gb) dataset. Looks like data.table edged out dplyr by a little bit. I am just starting to use data.table though so I may have not used the most efficient code. Oh, also, I narrowed it down to these 4 based on times from a 100million row dataset. See below:

set.seed(42)  
# 1 billion rows
df <- data.frame(age=sample(1:65,1e9,replace=TRUE),x=rnorm(1e9),y=rpois(1e9,25))



microbenchmark(df1 <- df %>% filter(age >= 5 & age <= 25),
               df2 <- df %>% filter(dplyr::between(df$age, 5, 25)),
               times=10)


Unit: seconds
                                      expr      min     lq   mean median     uq    max  neval
df %>% filter(age >= 5 & age <= 25)          15.327 15.796 16.526 16.601 17.086 17.996    10
df %>% filter(dplyr::between(df$age, 5, 25)) 14.214 14.752 15.413 15.487 16.121 16.447    10


DT <- as.data.table(df)
microbenchmark(dt1 <- DT[age %inrange% c(5, 25)],
               dt2 <- DT[age %between% c(5, 25)],
               times = 10)


Unit: seconds
                              expr    min     lq   mean median     uq    max neval
 dt1 <- DT[age %inrange% c(5, 25)] 15.122 16.042 17.180 16.969 17.310 22.138    10
 dt2 <- DT[age %between% c(5, 25)] 10.212 11.121 11.675 11.436 12.132 13.913    10
Andrew
  • 5,028
  • 2
  • 11
  • 21
1

Base R solution:

df <- Mydata[Mydata$x >= 3 & Mydata$x <= 7, ]

df
  x  y
3 3 45
4 4 54
5 5 65
6 6 78
7 7 97
sm925
  • 2,648
  • 1
  • 16
  • 28
1

The above answer is probably more user friendly but here are a couple more...

Mydata[Mydata$x >= 3 & Mydata$x <= 7, ]

  x  y
3 3 45
4 4 54
5 5 65
6 6 78
7 7 97

Which can be extended to return other columns, e.g., if you just wanted y:

Mydata[Mydata$x >= 3 & Mydata$x <= 7, 'y']

[1] 45 54 65 78 97

It can also return more than one column,e.g.:

Mydata <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                     y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156),
                     z = c(5, 4, 3, 2, 1, 0, -1, -2, -3, -4))   

Mydata[Mydata$x >= 3 & Mydata$x <= 7, c('y','z')]

   y  z
3 45  3
4 54  2
5 65  1
6 78  0
7 97 -1
miken32
  • 42,008
  • 16
  • 111
  • 154
antimuon
  • 252
  • 2
  • 12
  • This only indexes the columns - It assumes the dataframe always has x values between 3 and 7 at indexes 3 to 7. Redefine your dataframe to `Mydata <- data.frame(x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), y = c(0, 20, 30, 45, 54, 65, 78, 97, 102, 123, 156))` and it does not work – cacti5 Jun 29 '18 at 19:29
  • Why would you edit and still leave the index answer at the top? It's somewhere between misleading and flat-out incorrect. In the question text, OP is clearly states "the values of x" as the criteria, not "row indices". You should completely remove the index part as it does not address the question. (Or just delete, samadhi's answer already covers your edited method.) And `Mydata[Mydata$x[3:7], ]` is even more convoluted - hard to imagine a case where this would be wanted. – Gregor Thomas Jun 29 '18 at 19:54