Passing a variable for a column name?

Question

For example, suppose that you had a function that applied some DPLYR functions, but you couldn't expect datasets passed to this function to have the same column names.

For a simplified example of what I mean, say you had a data frame, arizona.trees:

arizona.trees
group arizona.redwoods   arizona.oaks 
A     23                 11        
A     24                 12  
B     9                  8 
B     10                 7
C     88                 22

and another very similar data frame, california.trees:

california.trees
group    california.redwoods california.oaks 
A        25                  50        
A        11                  33  
B        90                  5 
B        77                  3
C        90                  35

And you wanted to implement a function that returns the mean for the given groups (A, B, ... Z) for a given type of tree that would work for both of these data frames.

foo <- function(dataset, group1, group2, tree.type) { 
     column.name <- colnames(dataset[2])
     result <- filter(dataset, group %in% c(group1, group2) %>%
               select(group, contains(tree.type)) %>%
               group_by(group) %>%
               summarize("mean" = mean(column.name))
     return(result)
}

A desired output for a call of foo(california.trees, A, B, redwoods) would be:

result
       mean
A       18
B       83.5

For some reason, doing something like the implementation of foo() just doesn't seem to work. This is likely due to some error with the data frame indexing - the function seems to think I am attempting to get the mean of the column.name string, rather than retrieving the column and passing the column to mean(). I'm not sure how to avoid this. There's the issue of the implicit passing of the modified dataframe that can't be directly referenced with the pipe operator that may be causing the issue.

Why is this? Is there some alternative implementation that would work?

You have to read the "non-standard evaluation" (nse) vignette for dplyr (https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html) and note that this approach is due to change in the next dplyr release (http://dplyr.tidyverse.org/articles/programming.html) — talat, Apr 26 '17 at 18:29

akrun · Accepted Answer · 2017-04-26T19:08:06.557

We can use the quosure based solution from the devel version of dplyr (soon to be released 0.6.0)

foo <- function(dataset, group1, group2, tree.type){
        group1 <- quo_name(enquo(group1))
         group2 <- quo_name(enquo(group2))
         colN <- rlang::parse_quosure(names(dataset)[2])
         tree.type <- quo_name(enquo(tree.type))
        dataset %>%
                filter(group %in% c(group1, group2)) %>%
                select(group, contains(tree.type)) %>%
                group_by(group) %>%
                summarise(mean = mean(UQ(colN)))
        }


foo(california.trees, A, B, redwoods)
# A tibble: 2 × 2
#  group  mean
#  <chr> <dbl>
#1     A  18.0
#2     B  83.5

foo(arizona.trees, A, B, redwoods)
# A tibble: 2 × 2
#   group  mean
#  <chr> <dbl>
#1     A  23.5
#2     B   9.5

The enquotakes the input arguments and converts it to quosure, with quo_name, it is converted to string for using with %in%, the second column name is converted to quosure from string using parse_quosure and then it is unquoted (UQ or !!) for evaluation within summarise

NOTE: This is based on the OP's function about selecting the second column

The above solution was based on selecting the column based on position (as per the OP's code) and it may not work for other columns. So, we can match the 'tree.type' and get the 'mean' of the columns based on that

foo1 <- function(dataset, group1, group2, tree.type){

        group1 <- quo_name(enquo(group1))
         group2 <- quo_name(enquo(group2))


         tree.type <- quo_name(enquo(tree.type))
        dataset %>%
                filter(group %in% c(group1, group2)) %>%
                select(group, contains(tree.type)) %>%
                group_by(group) %>%
                summarise_at(vars(contains(tree.type)), funs(mean = mean(.)))
        }

The function can be tested for different columns in the two datasets

foo1(arizona.trees, A, B, oaks)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  11.5
#2     B   7.5

foo1(arizona.trees, A, B, redwood)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  23.5
#2     B   9.5

foo1(california.trees, A, B, redwood)
# A tibble: 2 × 2
#  group  mean
#   <chr> <dbl>
#1     A  18.0
#2     B  83.5

foo1(california.trees, A, B, oaks)
# A tibble: 2 × 2
#  group  mean
#  <chr> <dbl>
#1     A  41.5
#2     B   4.0

data

arizona.trees <- structure(list(group = c("A", "A", "B", "B", "C"), 
arizona.redwoods = c(23L, 
24L, 9L, 10L, 88L), arizona.oaks = c(11L, 12L, 8L, 7L, 22L)),
.Names = c("group", 
"arizona.redwoods", "arizona.oaks"), class = "data.frame",
 row.names = c(NA, -5L))

california.trees <- structure(list(group = c("A", "A", "B", "B", "C"), 
 california.redwoods = c(25L, 
11L, 90L, 77L, 90L), california.oaks = c(50L, 33L, 5L, 3L, 35L
)), .Names = c("group", "california.redwoods", "california.oaks"
), class = "data.frame", row.names = c(NA, -5L))

Thanks for the quick reply! Is there no way to do something like this without using the devel version of `dplyr`? — user3450277, Apr 26 '17 at 18:48
@user3450277 The devel version will be soon released as 0.6.0. It was expected in April, may be in May it gets into CRAN. Otherwise, you can use the `summarise_` etc functions, but they will soon get deprecated. If you are going for production level coding, then it is better to code for long-term — akrun, Apr 26 '17 at 18:53
And in case you've not worked with a development version before, installing it is easy (as is rolling back to the current release): `install.packages("devtools") # if you haven't` and then `devtools::install_github("tidyverse/dplyr")` Rolling back is just a normal `install.packages("dplyr")` call. — William Doane, Apr 27 '17 at 14:01
The Tidyeval features in the dev version are key to your question: https://github.com/tidyverse/dplyr/blob/master/NEWS.md#tidyeval — William Doane, Apr 27 '17 at 14:35

Passing a variable for a column name?

1 Answers1

data

Linked