The advice of using base R for simple functions is good, however it does not scale to more complex tidyverse functions and you lose the portability to dplyr backends like databases. If you want to create functions around tidyverse pipelines, you'll have to learn a bit about R expressions and the unquoting operator !!
. I recommend skimming over the first sections of https://tidyeval.tidyverse.org to get a rough idea of the concepts used here.
Since the function you'd like to create takes a bare column name and does not involve complex expressions (like you would pass to mutate()
or summarise()
), we don't need fancy stuff like quosures. We can work with symbols. To create a symbol, use as.name()
or rlang::sym()
.
as.name("mycolumn")
#> mycolumn
rlang::sym("mycolumn")
#> mycolumn
The latter has the advantage of being part of a larger family of functions: ensym()
, and the plural variants syms()
and ensyms()
. We are going to use ensym()
to capture a column name, i.e. delay the execution of the column in order to pass it to dplyr after a few transformations. Delaying the execution is called "quoting".
I have made a few changes to the interface of your function:
Take the data frames first for consistency with dplyr functions
Don't provide defaults for the data frames. These defaults are making too many assumptions.
Make by
and suffix
user-configurable, with reasonable defaults.
Here is the code, with explanations inline:
mydiff <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
stopifnot(is.character(suffix), length(suffix) == 2)
# Let's start by the easy task, joining the data frames
df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)
# Now onto dealing with the diff variable. `ensym()` takes a column
# name and delays its execution:
var <- rlang::ensym(var)
# A delayed column name is not a string, it's a symbol. So we need
# to transform it to a string in order to work with paste() etc.
# `quo_name()` works in this case but is generally only for
# providing default names.
#
# Better use base::as.character() or rlang::as_string() (the latter
# works a bit better on Windows with foreign UTF-8 characters):
var_string <- rlang::as_string(var)
# Now let's add the suffix to the name:
col1_string <- paste0(var_string, suffix[[1]])
col2_string <- paste0(var_string, suffix[[2]])
# dplyr::select() supports column names as strings but it is an
# exception in the dplyr API. Generally, dplyr functions take bare
# column names, i.e. symbols. So let's transform the strings back to
# symbols:
col1 <- rlang::sym(col1_string)
col2 <- rlang::sym(col2_string)
# The delayed column names now need to be inserted back into the
# dplyr code. This is accomplished by unquoting with the !!
# operator:
df %>%
dplyr::select(id, !!col1, !!col2) %>%
dplyr::filter(!!col1 != !!col2)
}
mydiff(df1, df2, b)
#> # A tibble: 1 x 3
#> id b.x b.y
#> <dbl> <chr> <chr>
#> 1 18 bar foo
mydiff(df1, df2, "a")
#> # A tibble: 1 x 3
#> id a.x a.y
#> <dbl> <chr> <chr>
#> 1 14 f k
You can also simplify the function by taking strings instead of bare column names. In this version, I'll use syms()
to create a list of symbols, and !!!
to pass it all at once to select()
:
mydiff2 <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
stopifnot(
is.character(suffix), length(suffix) == 2,
is.character(var), length(var) == 1
)
# Create a list of symbols from a character vector:
cols <- rlang::syms(paste0(var, suffix))
df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)
# Unquote the whole list as once with the big bang !!!
df %>%
dplyr::select(id, !!!cols) %>%
dplyr::filter(!!cols[[1]] != !!cols[[2]])
}
mydiff2(df1, df2, "a")
#> # A tibble: 1 x 3
#> id a.x a.y
#> <dbl> <chr> <chr>
#> 1 14 f k