Creating a function with an argument passed to dplyr::filter what is the best way to work around nse?

Question

Non standard evaluation is really handy when using dplyr's verbs. But it can be problematic when using those verbs with function arguments. For example let us say that I want to create a function that gives me the number of rows for a given species.

# Load packages and prepare data
library(dplyr)
library(lazyeval)
# I prefer lowercase column names
names(iris) <- tolower(names(iris))
# Number of rows for all species
nrow(iris)
# [1] 150

Example not working

This function doesn't work as expected because species is interpreted in the context of the iris data frame instead of being interpreted in the context of the function argument:

nrowspecies0 <- function(dtf, species){
    dtf %>%
        filter(species == species) %>%
        nrow()
}
nrowspecies0(iris, species = "versicolor")
# [1] 150

3 examples of implementation

To work around non standard evaluation, I usually append the argument with an underscore :

nrowspecies1 <- function(dtf, species_){
    dtf %>%
        filter(species == species_) %>%
        nrow()
}

nrowspecies1(iris, species_ = "versicolor")
# [1] 50
# Because of function name completion the argument
# species works too
nrowspecies1(iris, species = "versicolor")
# [1] 50

It is not completely satisfactory since it changes the name of the function argument to something less user friendly. Or it relies on autocompletion which I'm afraid is not a good practice for programming. To keep a nice argument name, I could do :

nrowspecies2 <- function(dtf, species){
    species_ <- species
    dtf %>%
        filter(species == species_) %>%
        nrow()
}
nrowspecies2(iris, species = "versicolor")
# [1] 50

Another way to work around non standard evaluation based on this answer. interp() interprets species in the context of the function environment:

nrowspecies3 <- function(dtf, species){
    dtf %>%
        filter_(interp(~species == with_species, 
                       with_species = species)) %>%
        nrow()
}
nrowspecies3(iris, species = "versicolor")
# [1] 50

Considering the 3 function above, what is the preferred - most robust - way to implement this filter function? Are there any other ways?

Data frame column names quotation is one of the reasons I start to prefer python. See [Tidyverse style pandas](https://stmorse.github.io/journal/tidyverse-style-pandas.html#parting-thoughts): ""“_Tidyverse allows a mix of quoted and unquoted references to variable names. In my (in)experience, the convenience this brings is accompanied by equal consternation. It seems to me a lot of the problems solved by tidyeval would not exist if all variables were quoted all the time, as in pandas, but there are likely deeper truths I’m missing here…_”"" — Paul Rougieux, Aug 20 '19 at 13:54

jaimedash · Accepted Answer · 2016-04-18T14:23:50.060

The answer from @eddi is correct about what's going on here. I'm writing another answer that addresses the larger request of how to write functions using dplyr verbs. You'll note that, ultimately, it uses something like nrowspecies2 to avoid the species == species tautology.

To write a function wrapping dplyr verb(s) that will work with NSE, write two functions:

First write a version that requires quoted inputs, using lazyeval and an SE version of the dplyr verb. So in this case, filter_.

nrowspecies_robust_ <- function(data, species){ 
  species_ <- lazyeval::as.lazy(species) 
  condition <- ~ species == species_ # *
  tmp <- dplyr::filter_(data, condition) # **
  nrow(tmp)
} 
nrowspecies_robust_(iris, ~versicolor)

Second make a version that uses NSE:

nrowspecies_robust <- function(data, species) { 
  species <- lazyeval::lazy(species) 
  nrowspecies_robust_(data, species) 
} 
nrowspecies_robust(iris, versicolor)

* = if you want to do something more complex, you may need to use lazyeval::interp here as in the tips linked below

** = also, if you need to change output names, see the .dots argument

For the above, I followed some tips from Hadley
Another good resource is the dplyr vignette on NSE, which illustrates .dots, interp, and other functions from the lazyeval package
For even more details on lazyeval see it's vignette
For a thorough discussion of the base R tools for working with NSE (many of which lazyeval helps you avoid), see the chapter on NSE in Advanced R

Thanks, the email from Hadley you mentioned made me look at `vignette("lazyeval")` which explains that "Every function that uses NSE should have a standard evaluation (SE) escape hatch that does the actual computation. The SE-function name should end with _." I would like an explanation of what Hadley means by "suitable for programming with" at the end of the `lazyeval` vignette. Does that imply that I should not use nse inside functions? — Paul Rougieux, Apr 18 '16 at 13:55
Yes, or at least you should avoid it when possible. Also check out section "Downsides of non-standard evaluation" here http://adv-r.had.co.nz/Computing-on-the-language.html The basic issue, as Hadley explains in that chapter, is NSE is very hard to reason about within a program because functions may _act differently in different contexts_. That is, when used interactively, an NSE function may act differently than when used in a function. — jaimedash, Apr 18 '16 at 14:20
Hadley explains the concept of "referential transparency" in [his keynote at the 2016 UseR conference](https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/Towards-a-grammar-of-interactive-graphics) (at 38min30s). "formula keep both the code and the environment in which this code should be evaluated, without actually doing the evaluation." I created an example using a formula and pasted it in a new answer. — Paul Rougieux, Aug 11 '16 at 13:48

eddi · Answer 2 · 2016-04-15T15:39:46.040

5

This question has absolutely nothing to do with non standard evaluation. Let me rewrite your initial function to make that clear:

nrowspecies4 <- function(dtf, boo){
    dtf %>%
        filter(boo == boo) %>%
        nrow()
}
nrowspecies4(iris, boo = "versicolor")
#150

The expression inside your filter always evaluates to TRUE (almost always - see example below), that's why it doesn't work, not because of some NSE magic.

Your nrowspecies2 is the way to go.

Fwiw, species in your nrowspecies0 is indeed evaluated as a column, not as the input variable species, and you can check that by comparing nrowspecies0(iris, NA) to nrowspecies4(iris, NA).

edited Apr 15 '16 at 15:39

answered Apr 15 '16 at 15:26

eddi

49,088
6
104
155

Not sure why, but this didn't work for me. I ended up using `filter_` as suggested in [the answer](https://stackoverflow.com/a/38898434/4190925) below. (E: my function used also `group_by` and passed the result further so maybe this was the reason) – jjj Oct 01 '19 at 09:50

Paul Rougieux · Answer 3 · 2020-05-05T13:56:16.300

1

in his 2016 UseR talk (@38min30s), Hadley Wickham explains the concept of referential transparency . Using a formula, the filter function can be reformulated as:

nrowspecies5 <- function(dtf, formula){
    dtf %>%
        filter_(formula) %>%
        nrow()
}

This has the added benefit of beeing more generic

# Make column names lower case
names(iris) = tolower(names(iris)) 
nrowspecies5(iris, ~ species == "versicolor")
# 50
nrowspecies5(iris, ~ sepal.length > 6 & species == "virginica")
# 41
nrowspecies5(iris, ~ sepal.length > 6 & species == "setosa")
# 0

edited May 05 '20 at 13:56

answered Aug 11 '16 at 13:56

Paul Rougieux

10,289
4
68
110

This throws the error `Error: object 'species' not found` – daaronr May 04 '20 at 16:37
That's because I like to have al my column names as lower case, I updated the answer with `names(iris) = tolower(names(iris))` anyway, `filter_()` is deprecated so I should probably modify the answer more extensively. – Paul Rougieux May 05 '20 at 13:54

Creating a function with an argument passed to dplyr::filter what is the best way to work around nse?

Example not working

3 examples of implementation

3 Answers3

Linked