3

I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.

  cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds

Any thoughts on how this could be done?

flodel
  • 87,577
  • 21
  • 185
  • 223
Nick Allen
  • 1,443
  • 1
  • 11
  • 29

3 Answers3

4

Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032

Here is an example where the I/O takes the form of csv files:

Your step?.R files should look like this:

#!/usr/bin/Rscript

OpenRead <- function(arg) {

   if (arg %in% c("-", "/dev/stdin")) {
      file("stdin", open = "r")
   } else if (grepl("^/dev/fd/", arg)) {
      fifo(arg, open = "r")
   } else {
      file(arg, open = "r")
   }
}

args  <- commandArgs(TRUE)
file  <- args[1]
fh.in <- OpenRead(file)

df.in <- read.csv(fh.in)
close(fh.in)

# do something
df.out <- df.in

# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)

and your csv input file should look like:

col1,col2
a,1
b,2

Now this should work:

cat in.csv | ./step1.R - | ./step2.R -

The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:

cat in.csv | step1 - | step2 -
Community
  • 1
  • 1
flodel
  • 87,577
  • 21
  • 185
  • 223
2

Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.

Make a main.r containing the following:

source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")

That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.


If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.

source("step1.r")  # defines step1_read_input, step1_f2
source("step2.r")  # defines step2_f2
source("step3.r")  # defines step3_f1, step3_f2, step3_f3
source("step4.r")  # defines step4_write_output

step1_read_input(...)
step1_f2(...)
....
step4write_output(...)
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 1
    ...probably for the same reason you write functions. Scripts are like building blocks you write, so you can later combine them in any way you like. Also, scripts are not limited to R: they can be written in any language you like. And you get all the unix goodies (e.g. `grep`) at the tip of your finger. – flodel Jul 30 '13 at 19:12
  • @flodel But these aren't generic scripts we're talking about. These are _R_ scripts, and they're executed in an environment that allows much richer objects than just a stream of bytes. Furthermore, R has many (too many?) functions that mimic OS-level utilities; if you need more, chances are that you shouldn't be using R for that purpose anyway. – Hong Ooi Jul 30 '13 at 19:15
  • Furthermore, if you want modular code in R, like functions, you use... functions. If `source`ing a bunch of function definitions isn't formal enough for you, you can put them into a package. – Hong Ooi Jul 30 '13 at 19:17
  • That's if all you know and use is R, from an interactive session. The fact that the OP is already writing scripts tells me he is beyond that simple usage. – flodel Jul 30 '13 at 19:21
  • @flodel ??? Interactive use has nothing to do with it. The point is that pipes are a suboptimal way of writing modular _R_ code. – Hong Ooi Jul 30 '13 at 19:26
  • The user wants to run `cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds` from the Unix command line. An interface. That's what I would call *interactive*. If indeed everything is written in R, the OP is better off writing functions or a package and using an interactive session in place of the Unix command line. If all you use is R, then writing executable scripts seems useless. And combining them in a `main.R` script is far from flexible. That being said, OP is already in the business of writing scripts and I can't imagine that's by mistake. – flodel Jul 30 '13 at 19:57
  • @flodel Now I'm confused. Your first reference to "interactive" was, I assumed, in relation to doing things at the R prompt -- not from the OS shell. You know, how most people, as well as R itself, would define an interactive R session, per `?interactive`. And the very example given in the OP is of chaining _multiple R scripts_ together. Not a combination of R with perl, python, bash, etc. R scripts only. – Hong Ooi Jul 30 '13 at 20:03
  • Initially downvoted this for a few seconds, but then as I kept reading I saw the point…and unlike my first impression the author is clearly aware of the situation and alternatives; not just providing a naïve or amateur "global variables are way easier!" reaction. Upvoted! – natevw Oct 22 '15 at 19:30
0

You'll need to add a line at the top of each script to read in from stdin. Via this answer:

in_data <- readLines(file("stdin"),1)

You'll also need to write the output of each script to stdout().

Community
  • 1
  • 1
David Marx
  • 8,172
  • 3
  • 45
  • 66