7

I've looked at other answers but cannot find a solution for the code below to work. Basically, I'm creating a function that inner_join the two data frame and filter based on a column inputted in the function.

The problem is that the filter part of the function doesn't work. However it works if I take filter off the function and append it like mydiff("a") %>% filter(a.x != a.y)

Any suggestion is helpful.

Note that I am function input in quotes

library(dplyr)

# fake data
df1<- tibble(id = seq(4,19,2), 
             a = c("a","b","c","d","e","f","g","h"), 
             b = c(rep("foo",3), rep("bar",5)))
df2<- tibble(id = seq(10, 20, 1), 
             a = c("d","a", "e","f","k","m","g","i","h", "a", "b"),
             b = c(rep("bar", 7), rep("foo",4)))

# What I am trying to do
dplyr::inner_join(df1, df2, by = "id") %>% select(id, b.x, b.y) %>% filter(b.x!=b.y)

#> # A tibble: 1 x 3
#>      id b.x   b.y  
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

# creating a function so that I can filter by difference in column if I have more columns
mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 = paste0(quo_name(filteron), "x")
  col_2 = paste0(quo_name(filteron), "y")
  my_df<- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter(col_1 != col_2)
}

# the filter part is not working as expected. 
# There is no difference whether i pipe filter or leave it out
mydiff("a")

#> # A tibble: 5 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    10 d     d    
#> 2    12 e     e    
#> 3    14 f     k    
#> 4    16 g     g    
#> 5    18 h     h
Tung
  • 26,371
  • 7
  • 91
  • 115
x85ms16
  • 587
  • 7
  • 17

5 Answers5

7

The reason it did not work in your original function was that col_1 was string but dplyr::filter() expected "unquoted" input variable for the LHS. Thus, you need to first convert col_1 to variable using sym() then unquote it inside filter using !! (bang bang).

rlang has really nice function qq_show to show what actually happens with quoting/unquoting (see the output below)

See also this similar question

library(rlang)
library(dplyr)

# creating a function that can take either string or symbol as input
mydiff <- function(filteron, df_1 = df1, df_2 = df2) {

  col_1 <- paste0(quo_name(enquo(filteron)), "x")
  col_2 <- paste0(quo_name(enquo(filteron)), "y")

  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))

  cat('\nwithout sym and unquote\n')
  qq_show(col_1 != col_2)

  cat('\nwith sym and unquote\n')
  qq_show(!!sym(col_1) != !!sym(col_2))
  cat('\n')

  my_df %>% 
    select(id, col_1, col_2) %>% 
    filter(!!sym(col_1) != !!sym(col_2))
}

### testing: filteron as a string
mydiff("a")
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#> 
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

### testing: filteron as a symbol
mydiff(a)
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#>  
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

Created on 2018-09-28 by the reprex package (v0.2.1.9000)

Tung
  • 26,371
  • 7
  • 91
  • 115
  • 1
    This is exactly what I was looking for. I did not know that I'd have to first `sym` then `!!` for it to work. Good to know that I was almost there! – x85ms16 Oct 01 '18 at 21:46
  • @x85ms16: if you use `quo_name` together with `enquo`, you can make the function more flexible as it now works with both string and symbol as input – Tung Oct 01 '18 at 22:45
5

From https://dplyr.tidyverse.org/articles/programming.html

Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that means they don't follow the usual R rules of evaluation.

This can sometimes create a few issues when attempting to wrap them in functions. Here is a base version of the function you created.

mydiff<- function(filteron, df_1=df1, df_2 = df2){

                 col_1 = paste0(filteron,"x")
                 col_2 = paste0(filteron, "y")

                 my_df <- merge(df1, df2, by="id", suffixes = c("x","y"))

                 my_df[my_df[, col_1] != my_df[, col_2], c("id", col_1, col_2)]  
         }

> mydiff("a")
  id ax ay
3 14  f  k
> mydiff("b")
  id  bx  by
5 18 bar foo

This will solve your problem and will likely work as one expects, now and in the future. With less dependencies on outside packages, you reduce these kind of issues and other quirks which may develop in the future as the package authors evolve their work.

Justin
  • 1,360
  • 12
  • 15
  • I think the advice about abandoning dplyr needs to be balanced with the downside of doing so, mainly you lose the portability of code to different sources of data. – Lionel Henry Sep 29 '18 at 05:22
  • 3
    Interesting point of view. But perhaps abandoning dplyr expands the portability of ones code as writing functions without it becomes simpler, more predictable, and more consistent. As functions are the building blocks of packages and packages remain the gold standard for shipping ones R code to others, it stands to reason that base code is more portable and reaches a wider range of data sources than dplyr. – Justin Sep 29 '18 at 06:03
  • 1
    @lionel How does not using dplyr effect portability of code ? – meh Oct 01 '18 at 16:56
  • I was talking about dplyr backends. – Lionel Henry Oct 01 '18 at 17:07
  • @Justin This is a great alternative of using just base R to my problem. Thank you – x85ms16 Oct 01 '18 at 21:50
1

Seems an evaluation issue to me. Try this modified mydiff function, using the lazyeval package:

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  criteria <- lazyeval::interp(~ x != y, .values = list(x = as.name(col_1), y = as.name(col_2)))
  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter_(criteria)
}

You can take a look at the Functions chapter from Hadley Wickham’s book Advanced R for more on that.

Augusto Fadel
  • 199
  • 1
  • 6
1

The advice of using base R for simple functions is good, however it does not scale to more complex tidyverse functions and you lose the portability to dplyr backends like databases. If you want to create functions around tidyverse pipelines, you'll have to learn a bit about R expressions and the unquoting operator !!. I recommend skimming over the first sections of https://tidyeval.tidyverse.org to get a rough idea of the concepts used here.

Since the function you'd like to create takes a bare column name and does not involve complex expressions (like you would pass to mutate() or summarise()), we don't need fancy stuff like quosures. We can work with symbols. To create a symbol, use as.name() or rlang::sym().

as.name("mycolumn")
#> mycolumn

rlang::sym("mycolumn")
#> mycolumn

The latter has the advantage of being part of a larger family of functions: ensym(), and the plural variants syms() and ensyms(). We are going to use ensym() to capture a column name, i.e. delay the execution of the column in order to pass it to dplyr after a few transformations. Delaying the execution is called "quoting".

I have made a few changes to the interface of your function:

  • Take the data frames first for consistency with dplyr functions

  • Don't provide defaults for the data frames. These defaults are making too many assumptions.

  • Make by and suffix user-configurable, with reasonable defaults.

Here is the code, with explanations inline:

mydiff <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(is.character(suffix), length(suffix) == 2)

  # Let's start by the easy task, joining the data frames
  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Now onto dealing with the diff variable. `ensym()` takes a column
  # name and delays its execution:
  var <- rlang::ensym(var)

  # A delayed column name is not a string, it's a symbol. So we need
  # to transform it to a string in order to work with paste() etc.
  # `quo_name()` works in this case but is generally only for
  # providing default names.
  #
  # Better use base::as.character() or rlang::as_string() (the latter
  # works a bit better on Windows with foreign UTF-8 characters):
  var_string <- rlang::as_string(var)

  # Now let's add the suffix to the name:
  col1_string <- paste0(var_string, suffix[[1]])
  col2_string <- paste0(var_string, suffix[[2]])

  # dplyr::select() supports column names as strings but it is an
  # exception in the dplyr API. Generally, dplyr functions take bare
  # column names, i.e. symbols. So let's transform the strings back to
  # symbols:
  col1 <- rlang::sym(col1_string)
  col2 <- rlang::sym(col2_string)

  # The delayed column names now need to be inserted back into the
  # dplyr code. This is accomplished by unquoting with the !!
  # operator:
  df %>%
    dplyr::select(id, !!col1, !!col2) %>%
    dplyr::filter(!!col1 != !!col2)
}

mydiff(df1, df2, b)
#> # A tibble: 1 x 3
#>      id b.x   b.y
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

mydiff(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

You can also simplify the function by taking strings instead of bare column names. In this version, I'll use syms() to create a list of symbols, and !!! to pass it all at once to select():

mydiff2 <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(
    is.character(suffix), length(suffix) == 2,
    is.character(var), length(var) == 1
  )

  # Create a list of symbols from a character vector:
  cols <- rlang::syms(paste0(var, suffix))

  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Unquote the whole list as once with the big bang !!!
  df %>%
    dplyr::select(id, !!!cols) %>%
    dplyr::filter(!!cols[[1]] != !!cols[[2]])
}

mydiff2(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k
Lionel Henry
  • 6,652
  • 27
  • 33
  • Great answer! Don't know why you were downvoted. Can we use `quo_name(enquo(var))` to make the function `mydiff2` more flexible i.e. can accept input either as a string or a symbol? – Tung Sep 29 '18 at 18:53
  • In this case it would not work because the inputs have to follow the select syntax, since they are forwarded to `select()`. At the same time they can't use select helpers because they are forwarded to `filter()` as well. So the common denominator is just bare column names. That's a good point, I'll add it to the bookdown. – Lionel Henry Sep 30 '18 at 07:03
  • In that case I'd prefer `mydiff()` to `mydiff2()` cause it's more flexible :) – Tung Sep 30 '18 at 15:31
  • @lionel this is super helpful. Thank you for explaining things in detail. Now I can refer back to this when I am writing more complex dplyr functions. Your tips on parameters and quoting are very helpful. One question, Do you need to unquote using `!!` or `!!!` every time you use the column from the parameters? – x85ms16 Oct 01 '18 at 21:49
  • Yes, the main pattern for creating functions around dplyr pipelines is "quote-and-unquote". You quote with `enquo()` or `ensym()` (or their plural variants), and you unquote with `!!` and `!!!`. – Lionel Henry Oct 02 '18 at 07:23
1

Finding index for col_1 != col_2 first might be enough for this problem.

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  my_df <-
    inner_join(df_1, df_2, by = "id", suffix = c("x", "y")) %>%
    select(id, col_1, col_2)
  # find indices of different columns
  same <- my_df[, col_1] != my_df[, col_2]
  # return for the rows
  my_df[same, ]
}
my_diff("a")
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k
younggeun
  • 923
  • 1
  • 12
  • 19