0

I have a R pipeline to run analyses on a big dataset. Currently I can start an analysis by calling my script from the terminal, giving it my analysis parameters. $ ./my_script.R --parameter1 a1 --parameter2 b1
The script loads the dataset from a .Rds file, but it takes more than a minute to load, every time I start the script.

Is there a way to keep the dataset in memory to run multiple analyses in a row (meaning $ ./my_script.R --parameter1 a2 --parameter2 b2 etc.)? Using the global environment maybe?
Thanks!

Pierre
  • 113
  • 6
  • You're saying that on the bash prompt, you call `./my_script.R...` and when it completes, you are back on the bash prompt and want to run the same (or a different) R script and have that data stay resident for the second R instance to reuse? – r2evans Oct 09 '20 at 21:37
  • @r2evans yes that's it. – Pierre Oct 09 '20 at 21:52
  • Here's the problem with that: when `myscript.R` is complete, R exits. The memory that it had requested and allocated for the purposes of its calculations (and holding that contents of the `.Rds` file) have since been deallocated and returned to the OS. If you want/need R objects to stay resident in memory, you must keep R active somehow, and that may not be compatible with this method of multiple (distinct) script files. – r2evans Oct 09 '20 at 22:00
  • One way to keep some form of R with resident memory is to provide some form of R *service*, which could be `shiny` (not likely good for this scenario), `plumber`, or `Rserve`. There are pros and cons for each, and they all have baggage (e.g., learning curve, overhead of a running process, configuration, authentication, state-management, etc). – r2evans Oct 09 '20 at 22:02
  • One way you *might* be able to improve your overall process is this: instead of providing `--param1 a1 --param2 b1`, allow for more than one iteration in a single call. For instance, you might support `--param1`-like args (no change from current capability), but also add an option `--params-csv somefile.csv`, where the CSV file is a 2-column csv with 1 or more pairs of `a1,b1`, one execution per line. Your script would need to be adjusted to accommodate this iteration, but you would be able to keep your large-ish `.Rds` in resident after loading it *once*. – r2evans Oct 09 '20 at 22:04
  • Thank you for your answers! plumber and Rserve look like good resources but probably overkilled for this given the learning curve too. Your last suggestion is excellent! I'll see how I can implement that. Thanks again! – Pierre Oct 09 '20 at 22:12

1 Answers1

1

One way to attack that problem is to allow the user to specify multiple pairs of arguments at the time of script call, so that the program can iterate over all of them at once (necessitating only one startup-cost).

Here's a sample script that uses a few things:

  1. library(optparse), for ease of arguments. There are others, nothing is required, I find it makes things look easy.
  2. The ability for the script to know if it is being sourced (and not run some code, useful for dev/testing) or being run from the command line (which would trigger some code to run). This is similar to python's if __name__ == '__main__': trick, something I answered a while ago as https://stackoverflow.com/a/47932989/3358272.

Neither of them are strictly necessary, but I find it helps demonstrate how to structure the script so that you can facilitate "one or more" type operations.

#!/usr/bin/env r
startup <- function() {
  message(Sys.time(), " Some expensive data load ...")
  Sys.sleep(3)
}

func1 <- function(x, y) {
  message(Sys.time(), " Called with (x,y): ", jsonlite::toJSON(list(x=x,y=y)))
}

if (sys.nframe() == 0L) {
  library(optparse)
  P <- OptionParser()
  P <- add_option(P, c("--param1"), dest = "p1", type = "character",
                  help = "Parameter 1", metavar = "P1")
  P <- add_option(P, c("--param2"), dest = "p2", type = "character",
                  help = "Parameter 2", metavar = "P2")
  P <- add_option(P, c("--param-csv"), dest = "pcsv", type = "character",
                  help = "CSV file with parameters in each column", metavar = "FILE")
  args <- parse_args(P, commandArgs(trailingOnly = TRUE))

  if (!is.null(args$pcsv)) {
    if (!file.exists(args$pcsv)) {
      stop("file not found: ", sQuote(args$pcsv))
    }
    params <- read.csv(args$pcsv, header = FALSE)
    if (!ncol(params) >= 2L) {
      stop("file does not have (at least) 2 columns")
    }
  } else {
    params <- data.frame(
      p1 = sapply(strsplit(args$p1, "[,[:space:]]+")[[1]], trimws),
      p2 = sapply(strsplit(args$p2, "[,[:space:]]+")[[1]], trimws)
    )
  }

  startup()

  for (rownum in seq_len(nrow(params))) {
    func1(params[[1]][rownum], params[[2]][rownum])
  }  
}

For the sake of this demo, startup is you loading your .Rds file (which takes 3 seconds here), and func1 is the rest of whatever processing you might be doing. (As a general hint, I tend to do as little work within the sys.nframe() == 0 block, so that the functions I write above it can be used interactively or with the script. It's just one way to organize code.)

This script supports three modalities:

  • your default invocation

    $ Rscript 64287443.R --param1 foo1 --param2 bar1
    2020-10-09 15:33:48 Some expensive data load ...
    2020-10-09 15:33:51 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
    

    one "job" at a time.

  • comma-separated multiple arguments, as in

    $ Rscript 64287443.R --param1 foo1,foo2 --param2 bar1,bar2
    2020-10-09 15:33:55 Some expensive data load ...
    2020-10-09 15:33:58 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
    2020-10-09 15:33:58 Called with (x,y): {"x":["foo2"],"y":["bar2"]}
    

    which is equivalent to running

    $ Rscript 64287443.R --param1 foo1 --param2 bar1
    $ Rscript 64287443.R --param1 foo2 --param2 bar2
    

    except that it is only incurring the startup cost once.

  • a CSV file of jobs, one param per column.

    $ cat params.csv
    foo1,bar1
    foo2,bar2
    foo3,bar3
    
    $ Rscript 64287443.R --param-csv params.csv
    2020-10-09 15:35:15 Some expensive data load ...
    2020-10-09 15:35:18 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
    2020-10-09 15:35:18 Called with (x,y): {"x":["foo2"],"y":["bar2"]}
    2020-10-09 15:35:18 Called with (x,y): {"x":["foo3"],"y":["bar3"]}
    

TODO:

  • the logic to strsplit a comma-separated array for --param1 and 2 is trusting, and should be broken down a little to test for unequal pairings, and either error or do something meaningful; as of now, it will fail
  • in general, there is very little error checking here, but that's context-sensitive
r2evans
  • 141,215
  • 6
  • 77
  • 149