443

When I need to filter a data.frame, i.e., extract rows that meet certain conditions, I prefer to use the subset function:

subset(airquality, Month == 8 & Temp > 90)

Rather than the [ function:

airquality[airquality$Month == 8 & airquality$Temp > 90, ]

There are two main reasons for my preference:

  1. I find the code reads better, from left to right. Even people who know nothing about R could tell what the subset statement above is doing.

  2. Because columns can be referred to as variables in the select expression, I can save a few keystrokes. In my example above, I only had to type airquality once with subset, but three times with [.

So I was living happy, using subset everywhere because it is shorter and reads better, even advocating its beauty to my fellow R coders. But yesterday my world broke apart. While reading the subset documentation, I notice this section:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Could someone help clarify what the authors mean?

First, what do they mean by "for use interactively"? I know what an interactive session is, as opposed to a script run in BATCH mode but I don't see what difference it should make.

Then, could you please explain "the non-standard evaluation of argument subset" and why it is dangerous, maybe provide an example?

Community
  • 1
  • 1
flodel
  • 87,577
  • 21
  • 185
  • 223
  • 15
    It is slightly less (but nut less than subset) to use with, `with(airquality, airquality[Month == 8 & Temp > 90, ])` – Tyler Rinker Mar 25 '12 at 13:09
  • 1
    This thread discusses the `subset()` warning: http://r.789695.n4.nabble.com/Variable-passed-to-function-not-used-in-function-in-select-in-subset-tt872217.html – jthetzel Mar 25 '12 at 13:14
  • 10
    You might also have a look at Cirlces 8.2.31 and 8.2.32 of 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.pdf – Patrick Burns Mar 25 '12 at 18:25
  • 13
    Try data.table instead, the default syntax is like airquality[Month == 8 & Temp > 90,] - very readable, and much faster. – Stian Håklev Sep 27 '13 at 20:23
  • 3
    OK. so if subset is bad to use - what about [ vs. dplyr::filter() ? – userJT Feb 12 '15 at 09:55
  • 3
    @RichieCotton, I know your section about dplyr and data.table is full of good intentions but I'm not sure about the medium (a disclaimer) and some of its content. `filter` and `[` are base functions hence still very much of actuality, while `plyr` and `data.table` remain third-party packages. For somebody writing professional code (e.g. a package), I would recommend using the base `[` over third party packages as to avoid dependencies as much as possible. Other people have suggested `plyr::filter` and `data.table.[` in the comments, I feel this is their right place IMHO. – flodel Mar 24 '15 at 09:44
  • @flodel I do think it's worth mentioning that the problems with `subset` have been worked around elsewhere, and there are so many comments on this page that it's worth mentioning this either in the question or in the top answer where it is easy to find. That said, it's your (very good) question and you should edit or rollback as you see fit. – Richie Cotton Mar 24 '15 at 10:22
  • 2
    Totally agree with you that they are good works worth mentioning. The problem I have is that while suggesting alternatives to `subset`, you are omitting to mention that there is nothing wrong with the base `[` function. Which remains the reference, the tool beginners (for which installing and learning dplyr should not be a priority) or advanced programmers (caring about not adding dependencies) should be using 99% of the time. So I find the disclaimer a bit misleading. I'll leave a chance for you or other advanced users to give their opinion before I choose to rollback (or not.) – flodel Mar 24 '15 at 11:40
  • 6
    For those wondering, `dplyr::filter` has the same problem. I.e. if the environment happens to have a variable with that name, it will use it instead of the variable in the data frame. Makes for confusing debugging! – CoderGuy123 Jan 28 '17 at 04:44
  • `.subset2` even faster if appropriate. See Hadley on performance in adv-r – Jack Wasey Jun 24 '17 at 11:06

2 Answers2

264

This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset (and functions like it) [here]. Go read it!

It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":

Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:

scramble <- function(x) x[sample(nrow(x)), ]

subscramble <- function(x, condition) {
  scramble(subset(x, condition))
}

subscramble(mtcars, cyl == 4)

This returns the error:

Error in eval(expr, envir, enclos) : object 'cyl' not found

because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:

cyl <- 4
subscramble(mtcars, cyl == 4)

cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)

(Run them and see for yourself, it's pretty crazy.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
joran
  • 169,992
  • 32
  • 429
  • 468
  • 3
    May I have some newbie questions for clarification? When we write `subset(mtcars, cyl == 4)` (at top level), where does R look for cyl? If it looks into the `mtcars` object that is passed to `subset()`, then shouldn't it be able to find `cyl` even if `scramble` is within another function, since `mtcars` is still being passed to it? If my question doesn't make sense, you could just elaborate more on why R can no longer find `cyl`. Thanks! – Heisenberg Oct 28 '13 at 22:12
  • 4
    @Anh Inside `subset.data.frame`, the thing we're trying to evaluate at that point is just `condition`. That doesn't exist in `mtcars`. So `subset.data.frame` uses `enclos = parent.frame()` to ensure that `condition` is correctly evaluated as `cyl == 4`. But then we've popped back up to the enclosing frame, and now when R looks for `cyl` it is no longer looking inside of `mtcars`. If we didn't use `enclos`, something like `subset(mtcars,cyl == a)` wouldn't work at all. – joran Oct 28 '13 at 22:28
  • does anyone know why subset() wouldn't just implement the faster and safer [,] method behind the scenes? – 3pitt Oct 02 '17 at 20:35
  • 1
    @MikePalmice It does. The last line of `subset.data.frame` is `x[r, vars, drop = drop]`. The problem is how to get from the unquoted `subset` and `select` arguments to something that you can validly pass to `[.data.frame`. – joran Oct 02 '17 at 21:33
  • @joran got it, thanks. how do you think about whether to use dplyr's filter instead of `[]`? – 3pitt Oct 20 '17 at 14:17
  • 1
    This is such an old question/answer with so many upvotes - so clearly I am overlooking something?? For me, your example code doesn't work on it's own. Hadley's example contains the pre-creation of another function called 'subset2'... The important difference between `[` and `subset()` lies then within this function... – tjebo Jun 20 '18 at 16:55
  • @Tjebo The example code in my answer works exactly as I describe for me in a clean R (3.4.3) session, as of 5 minutes ago. – joran Jun 20 '18 at 19:43
  • 1
    Thanks for checking. I might have misunderstood the intention of your code. But replacing it with subset using `[`, this results in the same 'weird' result as your code using `subset` - at least here :/ Also clean R 3.4.3 – tjebo Jun 20 '18 at 19:59
35

Also [ is faster:

require(microbenchmark)        
microbenchmark(subset(airquality, Month == 8 & Temp > 90),airquality[airquality$Month == 8 & airquality$Temp > 90,])
    Unit: microseconds
                                                           expr     min       lq   median       uq     max neval
                     subset(airquality, Month == 8 & Temp > 90) 301.994 312.1565 317.3600 349.4170 500.903   100
     airquality[airquality$Month == 8 & airquality$Temp > 90, ] 234.807 239.3125 244.2715 271.7885 340.058   100
bartektartanus
  • 15,284
  • 6
  • 74
  • 102
  • 41
    Yes and no. I think the time difference you are seeing is due to two things. 1) a small (< 100 microseconds) overhead and 2) `subset` unlike `[` removes rows where the filter evaluates to `NA`. Do this and you'll see that they are both as fast when compared "fairly": `x <- do.call(rbind, rep(list(airquality), 100)); microbenchmark(subset(x, Month == 8 & Temp > 90),{ i <- x$Month == 8 & x$Temp > 90; x[!is.na(i) & i ,] })` – flodel Apr 05 '14 at 16:20