9

A very simple question:

I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.

This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).

Is there a way to cache what a script does up to a certain point so every time I am only running the incremental lines of code (just as I would do by running R interactively)?

Thanks.

smci
  • 32,567
  • 20
  • 113
  • 146
Roberto
  • 2,800
  • 6
  • 29
  • 28

7 Answers7

10
## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
  mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.

chrisamiller
  • 2,712
  • 2
  • 20
  • 25
  • thanks chris, but how do I make sure the table is kept into the workspace in TextMate (or another editor)? – Roberto Jul 27 '10 at 20:14
  • 3
    if you're running non-interactively, use the save.image(file="mydata.Rdata") comand to save your workspace. Then load the workspace with load() at the beginning of each run. There's still going to be some grinding involved, as R needs to get all that data back into memory, but it'll save you the expensive computational steps. – chrisamiller Jul 27 '10 at 20:21
  • Also, consider leaving an R session open, editing your scripts in text mate, saving them, then loading the new code into R like so: source("~/pathto/myRscript.R") This way you don't have to reload data every time. Combine with some exists() statements and it'll speed things up considerably. – chrisamiller Jul 27 '10 at 20:24
  • 2
    `mytable` must be given as a character string. As posted, the code does not work. – Dirk Eddelbuettel Jul 27 '10 at 20:41
9

Some simple ways are doable with some combinations of

  • exists("foo") to test if a variable exists, else re-load or re-compute
  • file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.

There are also caching packages on CRAN that may be useful.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Dirk, but does not every object have to be recreated anyway every time the script is re-run? So foo will never exist and always be recomputed, right? – Roberto Jul 27 '10 at 20:08
  • It depends. Sometimes one gets data from, say, a database which may be extensive. You could then cache this in a file and use the timestamp (as I described) to see whether you need a new db access or not. It all depends on the particulars of your situation. – Dirk Eddelbuettel Jul 27 '10 at 20:26
6

After you do something you discover to be costly, save the results of that costly step in an R data file.

For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:

save(c(myVeryLargeDataFrame, VLDFSummary), 
  file="~/myProject/cachedData/VLDF.RData", 
  compress="bzip2")

The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.

After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:

load("~/myProject/cachedData/VLDF.RData")

This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.

JD Long
  • 59,675
  • 58
  • 202
  • 294
5

(Belated answer, but I began using SO a year after this question was posted.)

This is the basic idea behind memoization (or memoisation). I've got a long list of suggestions, especially the memoise and R.cache packages, in this query.

You could also take advantage of checkpointing, which is also addressed as part of that same list.

I think your use case mirrors my second: "memoization of monstrous calculations". :)

Another trick I use is to do a lot of memory mapped files, which I use a lot of, to store data. The nice thing about this is that multiple R instances can access shared data, so I can have a lot of instances cracking at the same problem.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111
  • 1
    I agree. Memoization/caching (with optional persistence) is the answer here, not ad hoc code. – Sim Jul 06 '12 at 03:03
3

I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.

JoFrhwld
  • 8,867
  • 4
  • 37
  • 32
3

Without going into too much detail, I usually follow one of three approaches:

  1. Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
  2. Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
  3. Use save and load to save the environment at crucial moments, again making sure that all names are unique.
Shane
  • 98,550
  • 35
  • 224
  • 217
-1

The 'mustashe' package is great for this kind of problem. In addition to caching the results, it also can include links to dependencies so that the code is re-run if the dependencies change.

Disclosure: I wrote this tool ('mustashe'), though I do not make any financial gains from others using it. I made it for this exact purpose for my own work and want to share it with others.

Below is a simple example. The foo variable is created and "stashed" for later. If the same code is re-run, the foo variable is loaded from disk and added to the global environment.

library(mustashe)

stash("foo", {
    foo <- some_long_running_opperation(1e3)
}
#> Stashing object.

The documentation has additional examples of more complex use-cases and a detailed explanation of how it works under the hood.

jhrcook
  • 1
  • 2
  • You should be sure to read Stack Overflow’s self-promotion rules. – Jeremy Caney Jun 22 '20 at 01:58
  • Your profile indicates you're associated with the sites you have linked. Linking to something you're affiliated with (e.g. a library, tool, product, or website) **without disclosing it's yours** is considered spam on Stack Overflow. See: [What signifies "Good" self promotion?](//meta.stackexchange.com/q/182212), [some tips and advice about self-promotion](/help/promotion), [What is the exact definition of "spam" for Stack Overflow?](//meta.stackoverflow.com/q/260638), and [What makes something spam](//meta.stackexchange.com/a/58035). – Samuel Liew Jun 22 '20 at 03:55
  • When you have edited your post to fix the issue(s) mentioned above, you can flag for a moderator to review & undelete. – Samuel Liew Jun 22 '20 at 03:55