Assign intermediate output to temp variable as part of dplyr pipeline

Question

Q: In an R dplyr pipeline, how can I assign some intermediate output to a temp variable for use further down the pipeline?

My approach below works. But it assigns into the global frame, which is undesirable. There has to be a better way, right? I figured my approach involving the commented line would get the desired results. No dice. Confused why that didn't work.

df <- data.frame(a = LETTERS[1:3], b=1:3)
df %>%
  filter(b < 3) %>%
  assign("tmp", ., envir = .GlobalEnv) %>% # works
  #assign("tmp", .) %>% # doesn't work
  mutate(b = b*2) %>%
  bind_rows(tmp)
  a b
1 A 2
2 B 4
3 A 1
4 B 2

You might like [`pipeR`](https://github.com/renkun-ken/pipeR), which [can assign (and a lot more) in the middle of a pipeline](https://renkun.me/pipeR-tutorial/Pipe-operator/Pipe-with-assignment.html), though it can get a bit hieroglyphic if you plan on sharing your code with anyone. — alistaire, Nov 01 '16 at 23:39
This is no better than your example but the syntax is arguably a bit nicer: `df %>% filter(b < 3) %>% { . ->> tmp } %>% mutate(b = b*2) %>% bind_rows(tmp)` — G. Grothendieck, Nov 01 '16 at 23:47
This is a strong code smell that you shouldn't be doing it. Tell us ***why*** you want to save the temporary filtered result `tmp`, i.e. what are you ultimately trying to achieve with your second pipeline? What's the problem if you don't save tmp and just repeat the `filter()` step? — smci, Nov 02 '16 at 00:05
@smci I mentioned below, setting up two separate pipes is basically what I've been doing. It's not huge problem. Just doesn't look nice and thought there might be best practice of which I wasn't aware. Seems not. — lowndrul, Nov 02 '16 at 01:47
fwiw, I found this page because I am interested in saving temporary results mid-way in a pipe for debugging in Rstudio. If something is going wrong in my pipeline, it is nice to be able to store temporary results and then interact with them via the console. — teichert, Nov 30 '17 at 18:18
Btw, @lowndrul, the reason `assign("tmp", .) %>%` doesn't work is that the default 'envir' argument for `assign()` is the "current environment" which is different at each stage of the pipeline. To see it, try inserting `{ print(environment()); . } %>%` into the pipeline at various points and see that a different address is printed each time. — teichert, Nov 30 '17 at 18:28

G. Grothendieck · Accepted Answer · 2018-08-20T16:28:33.730

31

This does not create an object in the global environment:

df %>% 
   filter(b < 3) %>% 
   { 
     { . -> tmp } %>% 
     mutate(b = b*2) %>% 
     bind_rows(tmp) 
   }

This can also be used for debugging if you use . ->> tmp instead of . -> tmp or insert this into the pipeline:

{ browser(); . } %>%

edited Aug 20 '18 at 16:28

answered Nov 02 '16 at 13:54

G. Grothendieck

254,981
17
203
341

Why the right-side assignment -> instead of the traditional <-? I would think that keeping traditional left-hand assignment should minimize problems with readability with syntax that is already unfamiliar to many R readers. – Tripartio Oct 03 '18 at 10:13
3

It's just a matter of taste and the assignment works the same, but to me the right-side assignment seems to be kind of in the flow of the pipe :-) – hannes101 Nov 23 '18 at 14:06
Important caveat to this otherwise great solution: This only works on the original object `.` which is piped through to the RHS. So if you used `names(.) ->> tmp`, then `names(.)` would be piped through. – Agile Bean May 30 '21 at 11:34

score 19 · Answer 2 · answered Dec 19 '17 at 03:38

I often find the need to save an intermediate product in a pipeline. While my use case is typically to avoid duplicating filters for later splitting, manipulation and reassembly, the technique can work well here:

df %>%
  filter(b < 3) %>%
  {. ->> intermediateResult} %>%  # this saves intermediate 
  mutate(b = b*2) %>%
  bind_rows(intermediateResult)

score 10 · Answer 3 · edited Sep 27 '18 at 16:58

10

pipeR is a package that extends the capabilities of the pipe without adding different pipes (as magrittr does). To assign, you pass a variable name, quoted with ~ in parentheses as an element in your pipe:

library(dplyr)
library(pipeR)

df %>>%
  filter(b < 3) %>>%
  (~tmp) %>>% 
  mutate(b = b*2) %>>%
  bind_rows(tmp)
##   a b
## 1 A 2
## 2 B 4
## 3 A 1
## 4 B 2

tmp
##   a b
## 1 A 1
## 2 B 2

While the syntax is not terribly descriptive, pipeR is very well documented.

edited Sep 27 '18 at 16:58

tyluRp

4,678
2
17
36

answered Nov 01 '16 at 23:49

alistaire

42,459
4
77
117

note that this approach doesn't seem to work with the original pipes in the question (i.e. `%>%) – teichert Nov 30 '17 at 18:31
Right; `%>>%` is an extension of `%>%`. Technically you could override the magrittr pipe with the pipeR pipe without breaking anything, but it would make figuring out why the code is weird harder. – alistaire Nov 30 '17 at 18:59

eipi10 · Answer 4 · 2016-11-01T23:10:20.293

4

You can generate the desired object at the location in the pipeline where it's needed. For example:

df %>% filter(b < 3) %>% mutate(b = b*2) %>%
  bind_rows(df %>% filter(b < 3))

This method avoids having to filter twice:

df %>%
  filter(b < 3) %>%
  bind_rows(., mutate(., b = b*2))

edited Nov 01 '16 at 23:10

answered Nov 01 '16 at 22:56

eipi10

91,525
24
209
285

1

That technically works for my toy example. But I think it's kind of infeasible for a more involved, long pipeline with multiple temp assignments – lowndrul Nov 01 '16 at 23:01
2

It would be helpful if you could provide a more complex use case as an example. I generally just create the desired object where it's needed in the pipeline. – eipi10 Nov 01 '16 at 23:03
Yeah, that's pretty much what I've been doing. But I thought there could be a more dplyr-ey way of going about it. Maybe not then. – lowndrul Nov 02 '16 at 00:25

teichert · Answer 5 · 2018-12-11T23:58:23.297

I was interested in the question for the sake of debugging (wanting to save intermediate results so that I can inspect and manipulate them from the console without having to separate the pipeline into two pieces which is cumbersome. So, for my purposes, the only problem with the OP's solution original solution was that it was slightly verbose.

This as can be fixed by defining a helper function:

to_var <- function(., ..., env=.GlobalEnv) {
  var_name = quo_name(quos(...)[[1]])
  assign(var_name, ., envir=env)
  .
}

Which can then be used as follows:

df <- data.frame(a = LETTERS[1:3], b=1:3)
df %>%
  filter(b < 3) %>%
  to_var(tmp) %>%
  mutate(b = b*2) %>%
  bind_rows(tmp)
# tmp still exists here

That still uses the global environment, but you can also explicitly pass a more local environment as in the following example:

f <- function() {
    df <- data.frame(a = LETTERS[1:3], b=1:3)
    env = environment()
    df %>%
      filter(b < 3) %>%
      to_var(tmp, env=env) %>%
      mutate(b = b*2) %>%
      bind_rows(tmp)
}
f()
# tmp does not exist here

The problem with the accepted solution is that it didn't seem to work out of the box with tidyverse pipes. ~~G. Grothendieck's solution doesn't work for the debugging use case at all.~~ (update: see G. Grothendieck's comment below and his updated answer!)

Finally, the reason assign("tmp", .) %>% doesn't work is that the default 'envir' argument for assign() is the "current environment" (see documentation for assign) which is different at each stage of the pipeline. To see this, try inserting { print(environment()); . } %>% into the pipeline at various points and see that a different address is printed each time. (It is probably possible to tweak the definition of to_var so that the default is the grandparent environment instead.)

See the info I added at the end of my post on how to use it for debugging. — G. Grothendieck, Aug 20 '18 at 16:28
@G.Grothendieck Excellent. The `{ . ->> tmp } %>%` solves my problem and is much simpler. (I've crossed-out part of my answer and up-voted yours.) — teichert, Dec 12 '18 at 00:01
I used `sv=\(x,y="p")assign(y,x,,.GlobalEnv)`, so for example `5:6%>%sv%>%factorial%>%c(p)` returns `120 720 5 6`. — nisetama, Aug 18 '22 at 03:54

Fraser Hay · Answer 6 · 2022-07-20T19:47:44.037

0

Just tacking-on a simplistic note to @tiechert's good post: As long as you're operating inside a function call, you can get the function's environment() reference and then use assign() to output the current state of the pipe to the function's environment and keep it separate from the global.

f = function(df) {
  env = environment()
  df %>%
    # <actions here> %>% 
    assign("tmp", ., envir = env) %>% # Assign to function environment, not the pipe
    # <more actions> %>% 
    .[]
  # tmp is accessible here, within the function
}
# tmp does not exist here

edited Jul 20 '22 at 19:47

answered Jul 07 '22 at 23:04

Fraser Hay

53
4

Not clear, give a use case – Julien Feb 13 '23 at 12:52

Assign intermediate output to temp variable as part of dplyr pipeline

6 Answers6

Linked

Related