29

Q: In an R dplyr pipeline, how can I assign some intermediate output to a temp variable for use further down the pipeline?

My approach below works. But it assigns into the global frame, which is undesirable. There has to be a better way, right? I figured my approach involving the commented line would get the desired results. No dice. Confused why that didn't work.

df <- data.frame(a = LETTERS[1:3], b=1:3)
df %>%
  filter(b < 3) %>%
  assign("tmp", ., envir = .GlobalEnv) %>% # works
  #assign("tmp", .) %>% # doesn't work
  mutate(b = b*2) %>%
  bind_rows(tmp)
  a b
1 A 2
2 B 4
3 A 1
4 B 2
smci
  • 32,567
  • 20
  • 113
  • 146
lowndrul
  • 3,715
  • 7
  • 36
  • 54
  • 7
    Just use 2 pipelines. This is needless obfuscation. – Hong Ooi Nov 01 '16 at 23:14
  • 1
    You might like [`pipeR`](https://github.com/renkun-ken/pipeR), which [can assign (and a lot more) in the middle of a pipeline](https://renkun.me/pipeR-tutorial/Pipe-operator/Pipe-with-assignment.html), though it can get a bit hieroglyphic if you plan on sharing your code with anyone. – alistaire Nov 01 '16 at 23:39
  • 7
    This is no better than your example but the syntax is arguably a bit nicer: `df %>% filter(b < 3) %>% { . ->> tmp } %>% mutate(b = b*2) %>% bind_rows(tmp)` – G. Grothendieck Nov 01 '16 at 23:47
  • This is a strong code smell that you shouldn't be doing it. Tell us ***why*** you want to save the temporary filtered result `tmp`, i.e. what are you ultimately trying to achieve with your second pipeline? What's the problem if you don't save tmp and just repeat the `filter()` step? – smci Nov 02 '16 at 00:05
  • @smci I mentioned below, setting up two separate pipes is basically what I've been doing. It's not huge problem. Just doesn't look nice and thought there might be best practice of which I wasn't aware. Seems not. – lowndrul Nov 02 '16 at 01:47
  • 1
    Ok so the consensus is "Don't do this, use two pipelines" – smci Nov 02 '16 at 08:24
  • fwiw, I found this page because I am interested in saving temporary results mid-way in a pipe for debugging in Rstudio. If something is going wrong in my pipeline, it is nice to be able to store temporary results and then interact with them via the console. – teichert Nov 30 '17 at 18:18
  • 3
    Btw, @lowndrul, the reason `assign("tmp", .) %>%` doesn't work is that the default 'envir' argument for `assign()` is the "current environment" which is different at each stage of the pipeline. To see it, try inserting `{ print(environment()); . } %>%` into the pipeline at various points and see that a different address is printed each time. – teichert Nov 30 '17 at 18:28

6 Answers6

31

This does not create an object in the global environment:

df %>% 
   filter(b < 3) %>% 
   { 
     { . -> tmp } %>% 
     mutate(b = b*2) %>% 
     bind_rows(tmp) 
   }

This can also be used for debugging if you use . ->> tmp instead of . -> tmp or insert this into the pipeline:

{ browser(); . } %>% 
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Why the right-side assignment -> instead of the traditional <-? I would think that keeping traditional left-hand assignment should minimize problems with readability with syntax that is already unfamiliar to many R readers. – Tripartio Oct 03 '18 at 10:13
  • 3
    It's just a matter of taste and the assignment works the same, but to me the right-side assignment seems to be kind of in the flow of the pipe :-) – hannes101 Nov 23 '18 at 14:06
  • Important caveat to this otherwise great solution: This only works on the original object `.` which is piped through to the RHS. So if you used `names(.) ->> tmp`, then `names(.)` would be piped through. – Agile Bean May 30 '21 at 11:34
19

I often find the need to save an intermediate product in a pipeline. While my use case is typically to avoid duplicating filters for later splitting, manipulation and reassembly, the technique can work well here:

df %>%
  filter(b < 3) %>%
  {. ->> intermediateResult} %>%  # this saves intermediate 
  mutate(b = b*2) %>%
  bind_rows(intermediateResult)    
GGAnderson
  • 1,993
  • 1
  • 14
  • 25
10

pipeR is a package that extends the capabilities of the pipe without adding different pipes (as magrittr does). To assign, you pass a variable name, quoted with ~ in parentheses as an element in your pipe:

library(dplyr)
library(pipeR)

df %>>%
  filter(b < 3) %>>%
  (~tmp) %>>% 
  mutate(b = b*2) %>>%
  bind_rows(tmp)
##   a b
## 1 A 2
## 2 B 4
## 3 A 1
## 4 B 2

tmp
##   a b
## 1 A 1
## 2 B 2

While the syntax is not terribly descriptive, pipeR is very well documented.

tyluRp
  • 4,678
  • 2
  • 17
  • 36
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • note that this approach doesn't seem to work with the original pipes in the question (i.e. `%>%) – teichert Nov 30 '17 at 18:31
  • Right; `%>>%` is an extension of `%>%`. Technically you could override the magrittr pipe with the pipeR pipe without breaking anything, but it would make figuring out why the code is weird harder. – alistaire Nov 30 '17 at 18:59
4

You can generate the desired object at the location in the pipeline where it's needed. For example:

df %>% filter(b < 3) %>% mutate(b = b*2) %>%
  bind_rows(df %>% filter(b < 3))

This method avoids having to filter twice:

df %>%
  filter(b < 3) %>%
  bind_rows(., mutate(., b = b*2))
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • 1
    That technically works for my toy example. But I think it's kind of infeasible for a more involved, long pipeline with multiple temp assignments – lowndrul Nov 01 '16 at 23:01
  • 2
    It would be helpful if you could provide a more complex use case as an example. I generally just create the desired object where it's needed in the pipeline. – eipi10 Nov 01 '16 at 23:03
  • Yeah, that's pretty much what I've been doing. But I thought there could be a more dplyr-ey way of going about it. Maybe not then. – lowndrul Nov 02 '16 at 00:25
1

I was interested in the question for the sake of debugging (wanting to save intermediate results so that I can inspect and manipulate them from the console without having to separate the pipeline into two pieces which is cumbersome. So, for my purposes, the only problem with the OP's solution original solution was that it was slightly verbose.

This as can be fixed by defining a helper function:

to_var <- function(., ..., env=.GlobalEnv) {
  var_name = quo_name(quos(...)[[1]])
  assign(var_name, ., envir=env)
  .
}

Which can then be used as follows:

df <- data.frame(a = LETTERS[1:3], b=1:3)
df %>%
  filter(b < 3) %>%
  to_var(tmp) %>%
  mutate(b = b*2) %>%
  bind_rows(tmp)
# tmp still exists here

That still uses the global environment, but you can also explicitly pass a more local environment as in the following example:

f <- function() {
    df <- data.frame(a = LETTERS[1:3], b=1:3)
    env = environment()
    df %>%
      filter(b < 3) %>%
      to_var(tmp, env=env) %>%
      mutate(b = b*2) %>%
      bind_rows(tmp)
}
f()
# tmp does not exist here

The problem with the accepted solution is that it didn't seem to work out of the box with tidyverse pipes. G. Grothendieck's solution doesn't work for the debugging use case at all. (update: see G. Grothendieck's comment below and his updated answer!)

Finally, the reason assign("tmp", .) %>% doesn't work is that the default 'envir' argument for assign() is the "current environment" (see documentation for assign) which is different at each stage of the pipeline. To see this, try inserting { print(environment()); . } %>% into the pipeline at various points and see that a different address is printed each time. (It is probably possible to tweak the definition of to_var so that the default is the grandparent environment instead.)

teichert
  • 3,963
  • 1
  • 31
  • 37
  • See the info I added at the end of my post on how to use it for debugging. – G. Grothendieck Aug 20 '18 at 16:28
  • @G.Grothendieck Excellent. The `{ . ->> tmp } %>%` solves my problem and is much simpler. (I've crossed-out part of my answer and up-voted yours.) – teichert Dec 12 '18 at 00:01
  • I used `sv=\(x,y="p")assign(y,x,,.GlobalEnv)`, so for example `5:6%>%sv%>%factorial%>%c(p)` returns `120 720 5 6`. – nisetama Aug 18 '22 at 03:54
0

Just tacking-on a simplistic note to @tiechert's good post: As long as you're operating inside a function call, you can get the function's environment() reference and then use assign() to output the current state of the pipe to the function's environment and keep it separate from the global.

f = function(df) {
  env = environment()
  df %>%
    # <actions here> %>% 
    assign("tmp", ., envir = env) %>% # Assign to function environment, not the pipe
    # <more actions> %>% 
    .[]
  # tmp is accessible here, within the function
}
# tmp does not exist here
Fraser Hay
  • 53
  • 4