3

Background/Motivation: I am running a bioinformatics pipeline that, if executed from beginning to end linearly takes several days to finish. Fortunately, some of the tasks don't depend upon each other so they can be performed individually. For example, Task 2, 3, and 4 all depend upon the output from Task 1, but do not need information from each other. Task 5 uses the output of 2, 3, and 4 as input.

I'm trying to write a script that will open new instances of R for each of the three tasks and run them simultaneously. Once all three are complete I can continue with the remaining pipeline.

What I've done in the past, for more linear workflows, is have one "master" script that sources (source()) each task's subscript in turn.

I've scoured SO and google and haven't been able to find a solution for this particular problem. Hopefully you guys can help.

From within R, you can run system() to invoke commands within a terminal and open to open a file. For example, the following will open a new terminal instance:

system("open -a Terminal .",wait=FALSE)

Similarly, I can start a new r session by using

system("open -a r .")

What I can't figure out for the life of me is how to set the "input" argument so that it sources one of my scripts. For example, I would expect the following to open a new terminal instance, call r within the new instance, and then source the script.

system("open -a Terminal .",wait=FALSE,input=paste0("r; source(\"/path/to/script/M_01-A.R\",verbose=TRUE,max.deparse.length=Inf)"))
Community
  • 1
  • 1
jrp355
  • 83
  • 8

2 Answers2

2

Answering my own question in the event someone else is interested down the road.

After a couple of days of working on this, I think the best way to carry out this workflow is to not limit myself to working just in R. Writing a bash script offers more flexibility and is probably a more direct solution. The following example was suggested to me on another website.

#!/bin/bash

# Run task 1
Rscript Task1.R

# now run the three jobs that use Task1's output
# we can fork these using '&' to run in the background in parallel
Rscript Task2.R &
Rscript Task3.R &
Rscript Task4.R &

# wait until background processes have finished
wait %1 %2 %3

Rscript Task5.R
jrp355
  • 83
  • 8
1

You might be interested in the future package (I'm the author). It allows you to write your code as:

library("future")

v1 %<-% task1(args_1)

v2 %<-% task2(v1, args_2)
v3 %<-% task3(v1, args_3)
v4 %<-% task4(v1, args_4)

v5 %<-% task5(v2, v3, v4, args_5)

Each of those v %<-% expr statements creates a future based on the R expression expr (and all of it's dependencies) and assigns it to a promise v. It is only when v is used, it will block and wait for the value v to be available.

How and where these futures are resolved is decided by the user of the above code. For instance, by specifying:

library("future")
plan(multiprocess)

at the top, then the futures (= the different tasks) are resolved in parallel on your local machine. If you use,

plan(cluster, workers = c("n1", "n3", "n3", "n5"))

they're resolved on those for machine (where n3 accepts two concurrent jobs).

This works on all operating systems (including Windows).

If you have access to a HPC compute with schedulers such as Slurm, SGE, and TORQUE / PBS, you can use the future.BatchJobs package, e.g.

plan(future.BatchJobs::batchjobs_torque)

PS. One reason for creating future was to do large-scale Bioinformatics in parallel / distributed.

HenrikB
  • 6,132
  • 31
  • 34
  • thank-you for bringing this possibility up. My work-around will work for the moment, but having an native r solution will help significantly with some upcoming projects (several HiSeq runs). I'll share your package with my group members. – jrp355 Mar 09 '17 at 18:17