2

Possible Duplicate:
Workflow for statistical analysis and report writing

I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?

This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?

Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.

Community
  • 1
  • 1
Alex
  • 19,533
  • 37
  • 126
  • 195
  • 2
    Have you read this yet? http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing – Josh O'Brien Oct 23 '12 at 18:07
  • nope, hadn't run across it. let me read! thanks – Alex Oct 23 '12 at 18:08
  • 1
    2 cents: No matter what, every function you write should include documentation. You'll thank yourself for it later. Your cleaned data should also be serialized (saveRDS) for ease of use in the future. Everything else should be a function, or a line by line analysis. There's no right, wrong, or "best" here - there's only preference. With that said, if you're going to go through the trouble of writing documentation - you might as well implement change management (git, others) and packaging of your analysis. – Brandon Bertelsen Oct 23 '12 at 19:50
  • I read the documentation for saveRDS but I can't understand how that is different than save. Is the only difference that the name of the object can be different when you restore it in the saveRDS case? – Alex Oct 23 '12 at 19:55
  • One works with objects (RDS) the other works with environments (save/load) – Brandon Bertelsen Oct 23 '12 at 21:22
  • I know this is an old question and marked as duplicate, but this comment seems better here than elsewhere: you might consider open-source Visual Studio as it all but forces you to organize your work into projects and solutions (groups of projects). They introduced R support on 2016-03-09: https://blogs.technet.microsoft.com/machinelearning/2016/03/09/announcing-r-tools-for-visual-studio-2/ – johnjps111 Mar 21 '16 at 16:13

2 Answers2

2

Some thoughts on research project organisation here:

http://software-carpentry.org/4_0/data/mgmt/

the take-home message being:

  • Use Version Control for your programs
  • Use sensible directory names
  • Use Version Control for your metadata
  • Really, Version Control is a good thing.
Spacedman
  • 92,590
  • 12
  • 140
  • 224
2

My analysis is a knitr document, with some external .R files which are called from it.

All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.

Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.

(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)

ROLO
  • 4,183
  • 25
  • 41