1

Let's say I'd like to write anscombe %>% lm_tidy("x1", "y1") (Actually, I'd like to write anscombe %>% lm_tidy(x1, y1), where x1 and y1 are part of the data frame). So, as the following function seems working:

plot_gg <- function(df, x, y) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(df, aes(x = !!x, y = !!y)) + geom_point() +
    geom_smooth(formula = y ~ x, method="lm", se = FALSE)
}

I started writing the following function:

lm_tidy_1 <- function(df, x, y) {
  x <- enquo(x)
  y <- enquo(y)
  fm <- y ~ x            ##### I tried many stuff here!
  lm(fm, data=df)
}
## Error in model.frame.default(formula = fm, data = df, drop.unused.levels = TRUE) : 
##   object is not a matrix

One comment in passing in column name as argument states that embrace {{...}} is a shorthand notation for the quote-unquote pattern. Unfortunately, error messages were different in both situations:

lm_tidy_2 <- function(df, x, y) {
  fm <- !!enquo(y) ~ !!enquo(x) # alternative: {{y}} ~ {{x}} with different errors!!
  lm(fm, data=df)
}
## Error:
## ! Quosures can only be unquoted within a quasiquotation context.

This seems working (based on @jubas's answer but we're stuck with string handling and paste):

lm_tidy_str <- function(df, x, y) {
  fm <- formula(paste({{y}}, "~", {{x}}))
  lm(fm, data=df)
}

Yet again, {{y}} != !!enquo(y). But it's worse: the following function breaks down with the same Quosure error as earlier:

lm_tidy_str_1 <- function(df, x, y) {
  x <- enquo(x)
  y <- enquo(y)
  fm <- formula(paste(!!y, "~", !!x))
  lm(fm, data=df)
}
  1. Is {{y}} != !!enquo(y)?
  2. How to pass data-variables to lm?

EDIT: Sorry, there were left-overs from my many trials. I want to directly pass the data-variables (say x1 and y1) to the function that is going to use them as formula components (such as lm) and not their string versions ("x1" and "y1"): I try to avoid strings as long as possible and it's more streamlined from the user perspective.

green diod
  • 1,399
  • 3
  • 14
  • 29
  • First are you passing quoted variables or unquoted ones? ie strings vs symbols? Also if you are going to write a function like this, why not just use `lm.fit?` – Onyambu May 25 '22 at 22:29
  • Give an example of how you would like to use this, and why you need this – Onyambu May 25 '22 at 22:30
  • do you know the package `rlang` - it has functions for metaprogramming. - And first of all - please show us the code you want to abstract over - which code - and which parts of that code you want to be abstracted? – Gwang-Jin Kim May 25 '22 at 22:36
  • you can use `x <- if (is.character(substitute(x))) x else deparse(substitute(x))` to convert quoted or unquoted variables to strings. then `lm(reformulate(x, y), data = data)` no need to add a dependency for one line of code – rawr May 26 '22 at 00:26
  • @rawr ino need for `if else`. just `as.charcter(substitute(x))` will do. Also check the answer provided – Onyambu May 26 '22 at 02:02
  • @onyambu i know.. it doesnt work for all cases, that's why i commented with mine that does – rawr May 26 '22 at 02:41
  • @rawr what cases doe it not work with? – Onyambu May 26 '22 at 02:53
  • @onyambu oh I think you meant `as.character(substitute(x))` in your answer but all you have there now is `substitute(x)`? thats the only thing I can think of. but yes the current answer does not work for mixing types – rawr May 26 '22 at 02:55
  • @rawr which mixing type? I still do not understand. Look at the examples used for the solution – Onyambu May 26 '22 at 02:58
  • @onyambu https://imgur.com/a/mvtBJMB – rawr May 26 '22 at 02:59
  • actually it only seems to fail if `x` is a symbol/language object – rawr May 26 '22 at 03:00
  • @rawr does the edit solve the issue? I am on my phone – Onyambu May 26 '22 at 03:03
  • @onyambu yes!.. – rawr May 26 '22 at 03:07
  • @onyambu I try to pass data-variables and not strings representing their names (I edited my first sentence as the string version was not what I intended to ask) – green diod May 26 '22 at 12:20
  • Have you tried the answer i gave? – Onyambu May 26 '22 at 12:22
  • If none of the three solutions provided answers your question, please consider expounding on what exactly you want. Seems there is something we are missing that need to be incorporated – Onyambu May 26 '22 at 13:02
  • I'd like a solution using the metaprogramming facilities of dplyr/rlang. I don't see why formulas would prevent a solution when it was easy with a `ggplot` call. – green diod May 26 '22 at 13:54
  • So in short you need a dplyr solution? Note that `ggplot` is a tidyverse package/function hence works with the `rlang` syntax, but `lm` is a base R function that does not do the same. For example `map(x, ~.x)` will work but `lapply(x, ~.x)` will not work the base R are different from the tidyverse – Onyambu May 26 '22 at 14:06
  • Unfortunately, there's no `lm` function in the tidyverse. And sometimes one needs outputs from base functions to feed functions from the tidyverse. – green diod May 26 '22 at 14:08
  • Actually, both solutions work fine. Yours has the advantage to take care of both string versions and plain data-variables. The downside is the loss of the actual formula used in the called function. My search for a dplyr solution is for consistency and to understand how one can cope with formulas in that case. – green diod May 26 '22 at 14:15

4 Answers4

4

Consider:

lm_tidy_1 <- function(df, x, y) {
  fm <- reformulate(as.character(substitute(x)), substitute(y))
  lm(fm, data=df)
}

lm_tidy_1(iris, Species, Sepal.Length)
lm_tidy_1(iris, 'Species', Sepal.Length)
lm_tidy_1(iris, Species, 'Sepal.Length')
lm_tidy_1(iris, 'Species', 'Sepal.Length')

Edit:

If you need the formula to appear, change the call object:

lm_tidy_1 <- function(df, x, y) { 
   fm <- reformulate(as.character(substitute(x)), substitute(y)) 
   res<-lm(fm, data=df) 
   res$call[[2]]<- fm
   res
}

lm_tidy_1(iris, Species, Sepal.Length) 

Call:
lm(formula = Sepal.Length ~ Species, data = df)

Coefficients:
      (Intercept)  Speciesversicolor   Speciesvirginica  
            5.006              0.930              1.582  
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Your solution works. A minor point is we lose the actual formula (only fm is displayed). Why the `substitute` and so we're stuck with converting back to strings with `reformulate`? I was looking for another way with metaprogramming. – green diod May 26 '22 at 12:33
  • @greendiod note that any formula passed into `lm` as a varible will be lost , unless you change the call of the object. Second, there is no changing to strings anywhere. `reformulate` makes a formula and not a string. – Onyambu May 26 '22 at 12:45
  • I know that `reformulate` returns a formula. But it does take strings as input. – green diod May 26 '22 at 13:06
  • @greendiod is there a problem with that? Do you want it to throw an error when strings are passed? – Onyambu May 26 '22 at 13:09
  • @onyambu That's good with the call[[2]]! At the end, R's syntax looks messy - whatever one uses - yours or my method ;) . – Gwang-Jin Kim May 26 '22 at 13:22
4

@BiranSzydek's answer is pretty good. However it has 3 downsides:

Call:
lm(formula = fm, data = .)
  1. One cannot see the formula nor the data which were actually used.
  2. One has to input the symbols as strings.
  3. The dependency from rlang - though it is a great package.

You can indeed solve this problem with pure base R!

The solution in pure base R

R is actually under-the-hood a Lisp. It is suitable for such meta-programming tasks. The only downside of R is its horrible syntax. Especially when facing meta-programming, it is not as beautiful and as elegant like the Lisp languages. The syntax really can confuse a lot - as you experienced it yourself when trying to solve this problem.

The solution is to use substitute() by which you can substitute code pieces in a quoted manner:

lm_tidy <- function(df, x, y) {
  # take the arguments as code pieces instead to evaluate them:
  .x <- substitute(x)
  .y <- substitute(y)
  .df <- substitute(df)
  # take the code piece `y ~ x` and substitute using list lookup table
  .fm <- substitute(y ~ x, list(y=.y, x=.x))
  # take the code `lm(fm, data=df)` and substitute with the code pieceses defined by the lookup table
  # by replacing them by the code pieces stored in `.fm` and `.df`
  # and finally: evaluate the substituted code in the parent environment (the environment where the function was called!)
  eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}

The trick is to use eval.parent(substitute( <your expression>, <a list which determines the evaluation lookup-table for the variables in your expression>)).

Beware of scoping! As long as <your expression> is constructed only using variables which are defined inside the function or inside the lookup-list given to substitute(), there won't be any scoping problems! But avoid to refer to any other variables within <your expression>! - So this is the only rule you have to obey to use eval()/eval.parent() safely in this context! but even if, the eval.parent() takes care, that the substituted code is executed within the environment where this function was called.

Now, you can do:

lm_tidy(mtcars, cyl, mpg)

the output is now as desired:

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

And we did this with pure base R!

The trick for safe use of eval() is really that every variable in the substitute() expression is defined/given inside the lookup tables for substitute() or the function's argument. In other words: None of the replaced variables refers to any dangling variables outside the function definition.

plot_gg function

So following these rules, your plot_gg function would be defined as:

plot_gg <- function(df, x, y) {
  .x <- substitute(x)
  .y <- substitute(y)
  .df <- substitute(df)
  .fm <- substitute( y ~ x, list(x=.x, y=.y))
  eval.parent(substitute(
    ggplot(df, aes(x=x, y=y)) + geom_point() +
      geom_smooth(formula = fm, method="lm", se=FALSE),
    list(fm=.fm, x=.x, y=.y, df=.df)
  ))
}

When you want to enter x and y as strings


lm_tidy_str <- function(df, x, y) {
  .x <- as.name(x)
  .y <- as.name(y)
  .df <- substitute(df)
  .fm <- substitute(y ~ x, list(y=.y, x=.x))
  eval.parent(substitute(lm(fm, data=df), list(fm=.fm, df=.df)))
}

plot_gg_str <- function(df, x, y) {
  .x <- as.name(x)
  .y <- as.name(y)
  .df <- substitute(df)
  .fm <- substitute( y ~ x, list(x=.x, y=.y))
  eval.parent(substitute(
    ggplot(df, aes(x=x, y=y)) + geom_point() +
      geom_smooth(formula = fm, method="lm", se=FALSE),
    list(fm=.fm, x=.x, y=.y, df=.df)
  ))
}

lm_tidy_str(mtcars, "cyl", "mpg")

# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
# 
# Coefficients:
# (Intercept)          cyl  
#      37.885       -2.876  
# 

require(ggplot2)
plot_gg_str(mtcars, "cyl", "mpg")



Gwang-Jin Kim
  • 9,303
  • 17
  • 30
  • The `plot_gg` function does work as it is, say `anscombe %>% plot_gg(x2, y2)`. The problem is: when I tried to use the same mechanism for the formula components like in `lm_tidy_1` (my first try), it does not work any more. – green diod May 26 '22 at 12:26
  • By the way, thanks for the pure base R solution but I don't mind extra-dependencies on dplyr/rlang because I would use their facilities elsewhere anyway. Another remark: your solution works for my intended use case `anscombe %>% lm_tidy(x1, y1)` while `anscombe %>% lm_tidy("x1", "y1")` results in an error. And still, it seems close to @onyambu 's solution which works in both cases. – green diod May 26 '22 at 12:42
  • yes, it doesn't work if you give the `x1` and `y1` as strings - by intention. you save 4 key strokes - or do you want them to be modifiable as strings? – Gwang-Jin Kim May 26 '22 at 12:47
  • @greendiod what you tried exactly with `lm_tidy_1`? – Gwang-Jin Kim May 26 '22 at 12:48
  • 1
    @greendiod do you need to work with strings also? You seem to say you only need symbols yet your comments suggests you also need strings. – Onyambu May 26 '22 at 12:49
  • @Gwang-JinKim @onyambu I'd rather use plain symbols and only resort to string names as a last resort. I asked to know more about why it didn't seem to work the same way as @onyambu's answer when on the outside, both solutions used `substitute`. – green diod May 26 '22 at 12:56
  • @greendiod `substitute` takes in symbols and strings. Reformulate works with both of them while `~` does not. Thats why the two are different. Also note that `as.name` is equivalent to `as.symbol` – Onyambu May 26 '22 at 12:58
  • @greendiod I used `substitute()` in multiple steps and here also with slightly different intentions. @onyambu's `substitute(x)` I use at beginning too - to take literally the parameters of the functions without evaluation first. But later in the function, I use substitute to exchange expression components in a given expression. – Gwang-Jin Kim May 26 '22 at 12:59
  • @greendiod - the advantage of taking x and y as strings is - one can place for them expressions. While in the non-string capture, they are not evaluated. so one can do things like: `lm_tidy_str(mtcars, names(mtcars)[2], names(mtcars)[1])` While this is not possible with the `lm_tidy(mtcars, names(mtcars[2], names(mtcars[1])`. – Gwang-Jin Kim May 26 '22 at 13:04
  • As I said, most of the time, the only thing that's needed is the data-variable symbol as function argument. But I'll keep in mind the extra flexibility of passing strings. – green diod May 26 '22 at 13:10
  • @greendiod - @onyambu's method however prints `Call: lm(formula = fm, data = df)` so you don't see the actual formula which was used. It should show `iris` instead of `df` and a formula using `Species` and `Sepal.Length` instead of `fm`. This detail seems to be of little importance - but it becomes important when you use plotting functions. My `plot_gg` and `plot_gg_str` label the x and y axis correctly - for automatic labeling it is essential. Later it makes the difference like whether you have to manually relabel your plots or not - meaning a lot of work time savings. – Gwang-Jin Kim May 26 '22 at 13:10
  • @Gwang-JinKim I agree with the part that one usually needs the actual formula. By the way, `?anscombe` uses an ugly string hack to update the `lm` formulas. What would be the solution if one can use dplyr/rlang? I wanted to use dplyr facilities but what worked pretty simple for `plot_gg` ànd `ggplot` did not for `lm_tidy` and `lm`. – green diod May 26 '22 at 13:17
  • @greendiod dplyr/rlang tries to make the control more fine-grained than base-r. The problem however is that R's syntax is already quite f**ked up. And everything looks quite easily complicated and messy. dplyr uses rlang. rlang tries to get more control - imitating lisp langauges (which have a more regular syntax). – Gwang-Jin Kim May 26 '22 at 13:20
  • Good thoughts, everyone. @green diod, there is a lot of information on this topic at: https://adv-r.hadley.nz/quasiquotation.html, including the correspondence between base R and the rlang/dplyr approaches. Hadley states that the advantage of using rlang is 1. the naming scheme is more consistent and 2. it allows for unquoting. I like just working with arguments as strings because I can just pass a vector of names to iterate, like lapply(names(mtcars)[-1], function(x) lm_tidy(mtcars, x, "mpg")) – Brian Syzdek May 26 '22 at 14:19
1

Wrap the formula in "expr," then evaluate it.

library(dplyr)
lm_tidy <- function(df, x, y) {
  x <- sym(x)
  y <- sym(y)
  fm <- expr(!!y ~ !!x)
  lm(fm, data = df)
}

This function is equivalent:

lm_tidy <- function(df, x, y) {
  fm <- expr(!!sym(y) ~ !!sym(x))
  lm(fm, data = df)
}

Then

lm_tidy(mtcars, "cyl", "mpg")

gives

Call:
lm(formula = fm, data = .)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

EDIT per comment below:

library(rlang)
lm_tidy_quo <- function(df, x, y){
    y <- enquo(y)
    x <- enquo(x)
    fm <- paste(quo_text(y), "~", quo_text(x))
    lm(fm, data = df)
}

You can then pass symbols as arguments

lm_tidy_quo(mtcars, cyl, mpg)
Brian Syzdek
  • 873
  • 6
  • 10
  • Your solution works for my original `anscombe %>% lm_tidy("x1", "y1")` use case but not for my intended use case `anscombe %>% lm_tidy(x1, y1)`. What is the difference between `sym` and `enquo`? – green diod May 26 '22 at 12:50
  • @greendiod `sym` is `rlang`/`dplyr`'s way to make symbols out of strings. In base R you use `as.name()` instead. `enquo()` is the equivalent to `substitute()` to take arguments as they are without evaluation. It comes from 'enquote'. – Gwang-Jin Kim May 26 '22 at 13:16
  • When I tried `anscombe %>% lm_tidy_enquo("x1", "y1")` with all `sym` calls replaced by `enquo`, I got an invalid model formula error. So if `enquo` is equivalent to `substitute`, what is the dplyr/rlang equivalent to `eval`? – green diod May 26 '22 at 13:24
  • Unquotation is discussed in detail here: https://adv-r.hadley.nz/evaluation.html, including using eval. Also discussed is how we can display the specified model in the function call by using expr_print(fm), so we could address @Gwang-jin Kim's objection #1 with following: lm_tidy <- function(df, x, y) { fm <- expr(!!sym(y) ~ !!sym(x)) rlang::expr_print(fm) lm(fm, data = df)} – Brian Syzdek May 26 '22 at 14:38
  • @BrianSyzdek Would you mind editing your answer with an extra part using data-variables/symbols instead of column names as strings? – green diod May 26 '22 at 15:11
  • See edit. I prefer the approach I originally stated because you can easily pass vectors of strings as predictors, which is probably your point of writing the function. BTW, I now prefer to pivot_long data and group by response variable names then group_map each set of nested data to lm: https://dplyr.tidyverse.org/reference/group_map.html, rather than using the quasiquotation for models – Brian Syzdek May 26 '22 at 15:54
  • As I commented earlier, I want to pass symbols as arguments as done with most of the tidyverse functions (data masking in https://dplyr.tidyverse.org/articles/programming.html which is related to the other references you gave). But the case of passing data variables to `ggplot` calls was addressed explicitly in those references. So as per https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html, I would now write `plot_gg <- function(df, x, y) { ggplot(df, aes(x = {{ x }}, y = {{ y }})) + geom_point() + geom_smooth(formula = y ~ x, method="lm", se = FALSE) }`. – green diod May 26 '22 at 17:38
  • I'll check `group_map`but for now, I'll stick with the `paste` approach unless someone shows a better way with quasiquotations. – green diod May 26 '22 at 17:40
  • You could change your ggplot data to `ggplot(df, aes(x = !! sym(x), y = !! sym(y)))`. You can then pass character vectors to arguments iteratively to generate multiple plots, like `lapply(names(mtcars)[-1], function(x) plot_gg(mtcars, x, "mpg"))`. – Brian Syzdek May 26 '22 at 17:54
  • Again, I 'd rather stay with symbols as arguments to stay consistent with the rest of the tidyverse. Now, I assume there's a way to use `lapply` with symbols ... – green diod May 27 '22 at 06:20
1

Here's what I use:

  fm <- as.formula(paste0(y, ' ~ ', x))
  lm(fm, data=df)

See:

?as.formula