3

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:

(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?

(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by @hadley in 1). Any ideas would be welcome!

The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:

MetaInfo <- function(message=NULL, filename)
{
  # message  - character string - Any message to be written into the information
  #            file (e.g., data used).
  # filename - character string - the name of the txt file (including relative
  #            path). Should be the same as the output file it describes (RData,
  #            csv, pdf).
  #

  if (is.null(filename))
  {
    stop('Provide an output filename - parameter filename.')
  }

  filename <- paste(filename, '.txt', sep='')

  # Try to get as close as possible to getting the file name from which the
  # function is called.
  source.file <- lapply(sys.frames(), function(x) x$ofile)
  source.file <- Filter(Negate(is.null), source.file)
  t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
              silent=TRUE)

  if (class(t.sf) == 'try-error')
  {
    source.file <- NULL
  }

  func <- deparse(sys.call(-1)) 

  # MetaInfo isn't always called from within another function, so func could
  # return as NULL or as general environment.
  if (any(grepl('eval', func, ignore.case=TRUE)))
  {
    func <- NULL
  }

  time    <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
  git.h   <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
  meta <- list(Message=message,
               Source=paste(source.file, ' on ', time, sep=''),
               Functions=func,
               System=Sys.info(),
               Session=sessionInfo(),
               Git.hash=git.h)
  sink(file=filename)
  print(meta)
  sink(file=NULL)
}

which can then be called in another function, stored in another file, e.g.:

source('metainfo.R')

RandomPlot <- function(x, y)
{
  fn <- 'random_plot'
  pdf(file=paste(fn, '.pdf', sep=''))
  plot(x, y)
  MetaInfo(message=NULL, filename=fn)
  dev.off()
}

x <- 1:10
y <- runif(10)

RandomPlot(x, y)

This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.

Community
  • 1
  • 1
tcam
  • 75
  • 5
  • I should also add that there is a good chance that Makefiles would be part of the solution. However, they are not always possible, because some analyses might require some 'manual' steps, such that you can't run R files in sequence. – tcam May 31 '13 at 19:47
  • Since you are considering Make, I added reference to Drake to my answer. – Alex Vorobiev May 31 '13 at 20:25
  • Thanks for the suggestion! Drake looks interesting. Still trying to get my head around it. I'm not particularly used to Make either except in its most rudimentary forms. The part I struggle with is that Make makes sense once you have a clear cut analysis and you have a Makefile to run the whole analysis from beginning to end. The problem is before getting to that stage, during exploratory analysis, Make seems less well suited. But it perhaps it induces good practice by forcing you to make the analysis as reproducible as possible. – tcam Jun 03 '13 at 09:18

5 Answers5

2

In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.

So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like

## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")

etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.

Rorschach
  • 31,301
  • 5
  • 78
  • 129
  • This is true, and ideally it would be as simple as this. But often, before you get to a concise and clear analysis you have a more open-ended stage where you wind up with several R files, loads of plots. The sort of stage you might perhaps delete once you have concise output, but if you don't and you come back to it later, it can be very difficult to find your way around. – tcam May 31 '13 at 19:42
2

There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.

If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.

There is also this new free tool called Drake which is advertised as "make for data".

Alex Vorobiev
  • 4,349
  • 21
  • 29
  • Thanks for the tips! I'll check out Project Template, although when I came across it I understood it as being more to do with basic setup of your project (folder layout, and so on), but perhaps it has evolved beyond this! As to ESS, alas I use vi. Maybe there's a vi-equivalent... I'll have a look. – tcam May 31 '13 at 20:15
1

I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions @noah and @alex!

tcam
  • 75
  • 5
0

There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.

install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)

Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.

readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)

But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)

landau
  • 5,636
  • 1
  • 22
  • 50
0

There is package developed by RStudio called pins that might address this problem.

hnagaty
  • 796
  • 5
  • 13