8

I have used %>%, the magrittr pipe, as given in its documentation by providing a function without empty parentheses to the RHS in this answer and got a comment that the recommended convention is to supply empty parentheses to the RHS.

library(magrittr)

1:3 %>% sum    # The documentation calls this: Basic use

1:3 %>% sum()  # It's also possible to supply empty parentheses
1:3 |> sum()   # And It's similar to |> the base pipe

An advantage might be that the syntax is like for |>, the base pipe.

But on the other hand, %>% could also be used like a function and there functions are typically provided without parentheses.

`%>%`(1:3, sum)

sapply(list(1:3), sum)
`%=>%` <- sapply
list(1:3) %=>% sum

do.call(sum, list(1:3))
`%<%` <- do.call
sum %<% list(1:3)

In this case, it looks like it's constant to use it without parentheses.

On the other hand, when using the placeholder, parentheses need to be provided.

"axc" %>% sub("x", "b", .)

What are the disadvantages when providing a function without parentheses to the pipe and what are the good technical reasons to provide it with empty parentheses?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
GKi
  • 37,245
  • 2
  • 26
  • 48
  • 2
    The comment you seem to be referring to also points out that the package author doesn't agree with the "recommendation". It's really just a matter of style. Just like there's no real difference between `<-` and `->` but people "recommend" the former. The function was written to accept a symbol that points to a function or a function call. – MrFlick May 25 '23 at 19:29
  • But is there a *reason* for this recommendation? – GKi May 25 '23 at 19:30
  • 4
    I _think_ this is a matter of taste, transparency and consistency rather than there being a good performance-based reason. One can _always_ chain calls that have parentheses, but it's not always possible to write a pipe _without_. When you skim someone's code, it's good to be able to see that they are chaining _calls_ to functions. I often don't bother with parentheses in interactive sessions, but if using in a code example or package, I use parentheses for consistency and clarity. – Allan Cameron May 25 '23 at 19:33
  • Is there any other citation for this recommendation in the wild other than that comment? In general style guides are opinionated just for consistency. There aren't meaningful technical differences between using two spaces, four spaces, or tabs when indenting code (unless you are super concerned with ascii character count). It's nothing more than a matter of opinion. – MrFlick May 25 '23 at 19:34
  • As a style guide I would use the manual and there (`?"%>%"`) I find no example with empty parentheses. – GKi May 25 '23 at 19:40
  • I don't hate the question, but I think it's mostly _preference_. The two pipes (`%>%` and `|>`) are not perfectly identical, and while `1:3 %>% sum` works yet `1:3 |> sum` does not, I don't think that's "good technical reason" for the use of parens. I believe that since `|>` works at the parsing level and `%>%` works on the function level suggests that one good technical reason is to use `|>` because it is ever-so-slightly faster, and it requires parens. Bottom line, I don't think a _good technical reason_ exists. Interesting conversation here. – r2evans May 25 '23 at 19:53

1 Answers1

9

But on the other hand %>% could also be used like a function and there functions are typically provided without parentheses.

No, this is confusing things: there is no single way in which functions are “typically provided”, it entirely depends on the usage.

You use the examples of sapply and do.call. Both are higher-order functions, which means that they expect functions as arguments.1 Since they expect functions as arguments, we can pass a name which refers to a function. But instead of a name we can also pass an arbitrary expression which evaluates to a function.

… In fact, don’t get hung up on the fact that you are passing a name in your example, it’s a red herring. Here’s an example where we pass the result of an expression (which returns a function) instead:

make_adder = function (y) {
    function (x) x + y
}

sapply(1 : 3, make_adder(2))

But this is potentially a distraction, because %>% does not expect a function object as its second argument. Instead, it expects a function call expression.

In my example above, sapply is a regular function, which evalutes its arguments using standard evaluation. Both its arguments, 1 : 3, as well as make_adder(2), are evaluated and the results are passed to sapply as arguments.2

%>% is not a regular function: it suppresses standard evaluation of the second argument. Instead, it keeps the expression in its unevaluated form and manipulates it. The way it does that is fairly complex but in the simplest case it injects its first argument into the expression and subsequently evaluates it. Here’s some pseudocode to illustrate this:

`%>%` = function (lhs, rhs) {
    # Get the unevaluated expression passed as `rhs`
    rhs_expr = substitute(rhs)
    new_rhs_expr = insert_first_argument_into(rhs_expr, lhs)
    eval.parent(new_rhs_expr)
}

This works for any valid rhs expression: sum(), head(3), etc. %>% transforms these into, respectively, sum(lhs), sum(lhs, 3), etc., and evaluates the resulting expression.

So far, this is perfectly consistent. However, the author of %>% chose to allow an additional, entirely distinct usage: instead of passing a function call expression as rhs, you can also pass a simple name. In that case, %>% does something completely different. Instead of constructing a new call expression that injects lhs, and evaluating that, it directly calls rhs(lhs):

`%>%` = function (lhs, rhs) {
    rhs_expr = substitute(rhs)

    if (is.name(rhs_expr)) {
        rhs(lhs)
    } else {
        # (code from above.)
    }
}

In other words, %>% accepts two fundamentally different types of arguments as rhs, and does different things for them.

This isn’t in itself a problem yet. It becomes a problem if we pass a function factory as the rhs. That’s a higher-order function which itself returns a function. make_adder from above is such a function factory.

So: what does 1 : 3 %>% make_adder(2) do? …

Error in make_adder(., 2) : unused argument (2)

Oh, right! make_adder(2) is a function call expression, so the first definition of %>% applies: transform the expression and evaluate it. So it attempts to evaluate make_adder(2, 1 : 3), and that fails, because make_adder only expects one argument.

Luckily for our sanity we can use make_adder with %>%. This doesn’t even require additional rules or documentation. With a bit of thinking it follows directly from the first definition above: we need to add another layer of function call, because we want %>% to call the function that is returned by make_adder. The following works:

1 : 3 %>% make_adder(2)()
# 3 4 5

%>% interpolated the lhs such that new_rhs became make_adder(2)(1 : 3).

We could make this a bit more readable by assigning the return value of make_adder(2) to a name:

add_2 = make_adder(2)

1 : 3 %>% make_adder(2)()      # (1)
#         \___________/
#               v
#             /‾‾‾\
1 : 3 %>%     add_2()          # (2)

We directly replaced a subexpression by a newly introduced name here. This is an extremely basic computer science concept, but it is so powerful that it has its own name: referential transparency. It’s a concept which makes reasoning about programs easier, because we know that we can always assign arbitrary sub-expression to a name and use that name in its place in a piece of code: (1) and (2) are identical.

But, actually, referential transparency requires that we can also do the replacement in reverse, i.e. replace the name by the value that it refers to. Sure enough, this works, and we get our original expression back:

1 : 3 %>%     add_2()          # (1)
#             \___/
#               v
#         /‾‾‾‾‾‾‾‾‾‾‾\
1 : 3 %>% make_adder(2)()      # (2)

(1) and (2) are still identical.

But unfortunately it does not always work:

1 : 3 %>%     add_2            # (1)
#             \___/
#               v
#         /‾‾‾‾‾‾‾‾‾‾‾\
1 : 3 %>% make_adder(2)        # (2)

(1) works, but (2) fails, even though we merely substituted add_2 with its definition. %>% does not preserve referential transparency.3

And that is why not using parentheses on the RHS is inconsistent, and why it is widely discouraged (e.g. by the tidyverse style guide). And it is also (as far as I understand) why the R core developers decided that |> always requires a function call expression as its RHS, and you cannot omit the parentheses.


1 We have a special word for this concept because accepting functions as arguments used to be very uncommon in mainstream programming languages.

2 This is a simplification. The truth is more complicated, but irrelevant here. If you are curious, see R Language Definition: Argument evaluation.

3 Violating referential transparency in R is quite easy because R gives us a lot of control over how we want to evaluate expressions. And often this can be quite handy. But when not used with care it can cause confusing code and subtle bugs, and it is recommended to weigh violations of referential transparency carefully against the benefits.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    To add a historical anecdote to this already too long answer, ‘magrittr’ isn’t the first implementation of a pipe operator in R, and [the first implementation that I am aware of](https://stat.ethz.ch/pipermail/r-help/2011-March/273361.html) has a similar inconsistency. – Konrad Rudolph May 25 '23 at 21:52