2

I am attempting to create a function that will calculate the mean of a column in a subsetted dataframe. The trick here is that I always to want to have a couple subsetting conditions and then have the option to pass more conditions to the functions to further subset the dataframe.

Suppose my data look like this:

dat <- data.frame(var1 = rep(letters, 26), var2 = rep(letters, each = 26), var3 = runif(26^2))

head(dat)
  var1 var2      var3
1    a    a 0.7506109
2    b    a 0.7763748
3    c    a 0.6014976
4    d    a 0.6229010
5    e    a 0.5648263
6    f    a 0.5184999

I want to be able to do the subset shown below, using the first condition in all function calls, and the second be something that can change with each function call. Additionally, the second subsetting condition could be on other variables (I'm using a single variable, var2, for parsimony, but the condition could involve multiple variables).

subset(dat, var1 %in% c('a', 'b', 'c') & var2 %in% c('a', 'b'))
   var1 var2      var3
1     a    a 0.7506109
2     b    a 0.7763748
3     c    a 0.6014976
27    a    b 0.7322357
28    b    b 0.4593551
29    c    b 0.2951004

My example function and function call would look something like:

getMean <- function(expr) {  
  return(with(subset(dat, var1 %in% c('a', 'b', 'c') eval(expr)), mean(var3)))  
}
getMean(expression(& var2 %in% c('a', 'b')))

An alternative call could look like:

getMean(expression(& var4 < 6 & var5 > 10))

Any help is much appreciated.


EDIT: With Wojciech Sobala's help, I came up with the following function, which gives me the option of passing in 0 or more conditions.

getMean <- function(expr = NULL) {
  sub <- if(is.null(expr)) { expression(var1 %in% c('a', 'b', 'c'))
  } else expression(var1 %in% c('a', 'b', 'c') & eval(expr))
  return(with(subset(dat, eval(sub)), mean(var3)))
}
getMean()
getMean(expression(var2 %in% c('a', 'b')))
Erik Shilts
  • 4,389
  • 2
  • 26
  • 51
  • 1
    You should make small change (add &) in subset function subset(dat, var1 %in% c('a', 'b', 'c') & eval(expr) and than call getMean(expression(var2 %in% c('a', 'b'))). – Wojciech Sobala Apr 03 '11 at 20:01
  • Great, that works. Do you want to make your response an Answer so I can accept it? – Erik Shilts Apr 04 '11 at 14:11
  • If you're going to work with expressions, I think you're better off replicating the logic of subset in your own function - your current attempt is going to create bugs that are very hard to fix. – hadley Apr 04 '11 at 14:26
  • Thanks, Hadley. It's not obvious to me why that would be. Can you point me towards something that would explain why or how to go about doing that? – Erik Shilts Apr 04 '11 at 14:35

2 Answers2

1

This is how I would approach it. The function getMean makes use of the R's handy default parameter settings:

getMean <- function(x, subset_var1, subset_var2=unique(x$var2)){
    xs <- subset(x, x$var1 %in% subset_var1 & x$var2 %in% subset_var2)

    mean(xs$var3)
}

getMean(dat, c('a', 'b', 'c'))
[1] 0.4762141

getMean(dat, c('a', 'b', 'c'), c('a', 'b'))
[1] 0.3814149
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • Thanks for the answer. I wasn't clear enough in my initial post so I edited it to include that the subsetting may be done on multiple variables in various ways. I'm not sure how I'm going to need to subset the data, so the the function needs to be flexible enough to handle other ways besides `%in%`. – Erik Shilts Apr 03 '11 at 18:01
1

It can be simplified with defalut expr=TRUE.

getMean <- function(expr = TRUE) {
  return(with(subset(dat, var1 %in% c('a', 'b', 'c') & eval(expr)), mean(var3)))
}
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27