Use dplyr::filter() within function

Question

I'm just beginning to learn how to write my own functions, and I'm trying to write a compute_means function for a very specific kind of data frame. This question seems similar, but it didn't get an answer and I haven't found anything else that seems to address it.

My data looks something like this:

student <- c("alw", "alw", "bef", "bef")
semester <- c("autumn", "spring", "autumn", "spring" )
test1 <- c(87, 88, 90, 78)
test2 <- c(67, 78, 81, 88)

x <- data.frame(student, semester, test1, test2)

What I would like to be able to do is to write a function where I can compute the means, either grouped by semester, or by student and semester, or for just a single student. I can get the groups of students to work, but I'm getting stuck when I try to compute the means for the test scores for a single student. Here is what i have so far (the problematic section is the else if part):

compute_means <- function(df, student = NA, separate = FALSE){
    if (!separate & is.na(student)){
       df %>%
        group_by(semester) %>%
        summarise(count = n(), test1 = mean(test1), test2 = mean(test2)) %>%
        mutate(students = c("AllStudnts")) %>%
        select(students, semester: test2)  
    }
else if(!separate & !is.na(student)){
    df %>%
        filter(student == student) %>%
        group_by(semester) %>%
        summarise(count = n(), test1 = mean(test1), test2 = mean(test2)) %>%
        mutate(student = student)

    }
else{
    df %>%
        group_by(student, semester) %>%
        summarise(count = n(), test = mean(test1), test2 = mean(test2))     
    }
}

compute_means(x) does what i think it would: I get the mean for all students by semester. compute_means(x, separate = TRUE) also does what I think it would. However, compute_means(x, student = "alw") doesn't do what I thought it would. Instead of getting alw, I get the same thing that I would if I didn't have filter().. I imagine that it must be easy to do this, but I can't figure out what it would be.

When using `dplyr` in functions you [need to use the standard evaluation versions of the dplyr functions (just append `_` to the function names, ie. `filter_`)](http://stackoverflow.com/a/27975126/4002530) — tospig, Jan 17 '16 at 21:56
I'm not sure if that worked for me. It looks like it returns the same value as before. — JoeF, Jan 17 '16 at 21:59
If I use `mutate_()`, I get `Error: binding not found: 'alw'` . — JoeF, Jan 17 '16 at 22:12
What version of `dplyr` are you using? `packageVersion('dplyr')` tells you. Everything runs fine for me with version ‘0.4.3’ — Gopala, Jan 17 '16 at 22:26
Strange. I'm running the same ‘0.4.3’. When you say everything runs fine, is that with my original code or with the `filter_()` addition? — JoeF, Jan 17 '16 at 22:27
@JoeF Question, why is there a trailing `mutate(student = student)` in the 1st `else if` ? It doesn't do anything as far as I can tell. — steveb, Jan 18 '16 at 04:18
I think I found what will fix this but I am not sure why it is an issue (perhaps someone else can explain). In the function parameters, change `student=NA` to something like `student_name=NA`. You will also have to change `student` in a number of locations in the function. — steveb, Jan 18 '16 at 04:28
Breaking it down a little further, creating a function like this `filter_student <- function(df, student = NA, separate = FALSE) { df %>% filter(student == student) }` and calling like this `filter_student(x, 'alw')` will return the entire data frame, not filtered as you might think. Like mentioned in my previous comment, changing the parameter `student` to something else fixes the issue. — steveb, Jan 18 '16 at 04:40

steveb · Accepted Answer · 2016-01-18T06:14:02.227

Below is a modified version of your function that should give you what you expect. I changed the parameter student to student_name. I also removed the trailing mutate(student = student) as it looks like it is not needed, and I added a pipe to ungroup to remove remaining groupings as they are likely not needed.

compute_means <- function(df, student_name = NA, separate = FALSE){
    if (!separate & is.na(student_name)){
       df %>%
        group_by(semester) %>%
        summarise(count = n(), test1 = mean(test1), test2 = mean(test2)) %>%
        mutate(students = c("AllStudnts")) %>%
        select(students, semester: test2)
    }
else if(!separate & !is.na(student_name)){
    df %>%
        filter(student == student_name) %>%
        group_by(semester) %>%
        summarise(count = n(), test1 = mean(test1), test2 = mean(test2))
    }
else{
    df %>%
        group_by(student, semester) %>%
        summarise(count = n(), test = mean(test1), test2 = mean(test2)) %>%
        ungroup # added since you don't need the remaining grouping.
    }
}

Starting with the input x

> x
  student semester test1 test2
1     alw   autumn    87    67
2     alw   spring    88    78
3     bef   autumn    90    81
4     bef   spring    78    88

Here is the output using various calls to the function compute_means

> compute_means(x)
Source: local data frame [2 x 5]

    students semester count test1 test2
       (chr)   (fctr) (int) (dbl) (dbl)
1 AllStudnts   autumn     2  88.5    74
2 AllStudnts   spring     2  83.0    83
> compute_means(x, separate = TRUE)
Source: local data frame [4 x 5]
Groups: student [?]

  student semester count  test test2
   (fctr)   (fctr) (int) (dbl) (dbl)
1     alw   autumn     1    87    67
2     alw   spring     1    88    78
3     bef   autumn     1    90    81
4     bef   spring     1    78    88
> compute_means(x, student_name = 'alw')
Source: local data frame [2 x 4]

  semester count test1 test2
    (fctr) (int) (dbl) (dbl)
1   autumn     1    87    67
2   spring     1    88    78
> compute_means(x, student_name = 'bef')
Source: local data frame [2 x 4]

  semester count test1 test2
    (fctr) (int) (dbl) (dbl)
1   autumn     1    90    81
2   spring     1    78    88

EDIT

What is happening with something like filter(student == student) (in the code from OP) is that in the context of filter, the item student is a reference to student in df, on both sides of the ==, not the function parameter.

Thanks. That worked. The reason why I had `mutate(student = student)` in the original was because I wanted an ID variable in the dataset (in case I wanted to join two datasets later, say). Otherwise, I would have means without knowing who they belonged to. It was also interesting to see that `mutate()` could take the argument in the (original) function but not `filter()`. I'm curious why the old version didn't work, but this is good enough. — JoeF, Jan 18 '16 at 05:59
@JoeF What is happening with something like `filter(student == student)` is that in the context of `filter`, the item `student` is a reference to `student` in `df`, on both sides of the `==`, not the function parameter. — steveb, Jan 18 '16 at 06:08

Use dplyr::filter() within function

1 Answers1