7

When working with data frames, it is common to need a subset. However use of the subset function is discouraged. The trouble with the following code is that the data frame name is repeated twice. If you copy&paste and munge code, it is easy to accidentally not change the second mention of adf which can be a disaster.

adf=data.frame(a=1:10,b=11:20)
print(adf[which(adf$a>5),])  ##alas, adf mentioned twice
print(with(adf,adf[{a>5},])) ##alas, adf mentioned twice
print(subset(adf,a>5)) ##alas, not supposed to use subset

Is there a way to write the above without mentioning adf twice? Unfortunately with with() or within(), I cannot seem to access adf as a whole?

The subset(...) function could make it easy, but they warn to not use it:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Chris
  • 1,219
  • 2
  • 11
  • 21
  • 1
    Using `filter` from `dplyr`. i.e. `filter(adf, a >5)` is similar to `subset`. If you are using `data.table`. `setDT(adf)[a>5]` – akrun May 03 '15 at 15:50
  • 2
    I'm with @akrun here and stopped using `data.frames` long ago. Once you''ll convert your data set to a `data.table`, all your syntax will become much shorter. Though I just want to mention that you are using way too much code here. You neither don't need `print` or `which`, just `adf[adf$a>5,] ` will do which in turn doesn't look too confusing to me. – David Arenburg May 03 '15 at 15:59
  • If you want to know, why the use of `subset()` is not encouraged, please have a look [this SO question](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset). – MERose May 03 '15 at 16:03
  • writing the name of the data set twice should be the *least* of your worries. and you should not be copy/pasting *anything* – rawr May 03 '15 at 16:31
  • akrun, I like your suggestion of filter. – Chris May 03 '15 at 19:21

2 Answers2

1

As @akrun states, I would use dplyr's filter function:

require("dplyr")
new <- filter(adf, a > 5)
new

In practice, I don't find the subsetting notation ([ ]) problematic because if I copy a block of code, I use find and replace within RStudio to replace all mentions of the dataframe in the selected code. Instead, I use dplyr because the notation and syntax is easier to follow for new users (and myself!), and because the dplyr functions 'do one thing well.'

Phil
  • 4,344
  • 2
  • 23
  • 33
1

After some thought, I wrote a super simple function called given:

given=function(.,...) { with(.,...) }

This way, I don't have to repeat the name of the data.frame. I also found it to be 14 times faster than filter(). See below:

adf=data.frame(a=1:10,b=11:20)
given=function(.,...) { with(.,...) }
with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)

Using microbenchmark

> adf=data.frame(a=1:10,b=11:20)
> given=function(.,...) { with(.,...) }
> with(adf,adf[a>5 & b<18,]) ##adf mentioned twice :(
  a  b
6 6 16
7 7 17
> given(adf,.[a>5 & b<18,]) ##adf mentioned once :)
  a  b
6 6 16
7 7 17
> dplyr::filter(adf,a>5,b<18) ##adf mentioned once...
  a  b
1 6 16
2 7 17
> microbenchmark(with(adf,adf[a>5 & b<18,]),times=1000)
Unit: microseconds
                             expr    min     lq     mean median     uq     max neval
 with(adf, adf[a > 5 & b < 18, ]) 47.897 60.441 67.59776 67.284 70.705 361.507  1000
> microbenchmark(given(adf,.[a>5 & b<18,]),times=1000)
Unit: microseconds
                            expr    min     lq     mean median    uq     max neval
 given(adf, .[a > 5 & b < 18, ]) 48.277 50.558 54.26993 51.698 56.64 272.556  1000
> microbenchmark(dplyr::filter(adf,a>5,b<18),times=1000)
Unit: microseconds
                              expr     min       lq     mean   median       uq      max neval
 dplyr::filter(adf, a > 5, b < 18) 524.965 581.2245 748.1818 674.7375 889.7025 7341.521  1000

I noticed that given() is actually a tad faster than with(), due to the length of the variable name.

The neat thing about given, is that you can do some things inline without assignment: given(data.frame(a=1:10,b=11:20),.[a>5 & b<18,])

Cristik
  • 30,989
  • 25
  • 91
  • 127
Chris
  • 1,219
  • 2
  • 11
  • 21