2

I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.

Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).

The experiment-specific projects should ideally be:

  1. Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
  2. Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
  3. Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).

At the moment I can see three ways of doing this:

  1. Re-create the data cleaning process
    • But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
  2. Access the relevant scripts/functions by changing the working directory
    • But: even if I used here it seems that this would introduce poor reproducibility.
  3. Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)

Is there any other way of doing this?

Edit: I tried @landau's suggestion of using a single drake plan, which worked well for a while, until (similar to @vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.

shir
  • 51
  • 6
  • If the project itself is to be kept 'as-is', it seems like using `git` (which is integrates nicely with Rstudio) and simply clone the entire project. If the project has to change a lot it could be cloned and altered on each front, either into new branches (not recommended) or new repositories. – Oliver Aug 20 '20 at 07:14
  • 1
    Are you searching for `save()` and `load()`? `save` writes an external representation of R objects to the specified file - usually .RData extension. – ismirsehregal Aug 20 '20 at 07:56
  • @Oliver I'd like to be able to make changes on one project (e.g., change a variable name in the data), and those changes to be available in the data used in another project as automatically as possible. Wouldn't cloning create independent instantiations of the project (and therefore the data generated by it)? – shir Aug 20 '20 at 22:34
  • @ismirsehregal I know about those, but I want a way of accessing the data (e.g., as .RData) from another project, other than specifying a (non-relative) path. – shir Aug 20 '20 at 22:38
  • @Shir indeed it would create independent instantiations of the dataset(s), project files etc. in the case of cloning of the repository. This would not be the case if one used branches to change the dataset however. The files itself however would not need to take space on your own computer, as you could have them uploaded to a private/public repository on github/gitlab. – Oliver Aug 21 '20 at 10:34

3 Answers3

2

My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.

upstream_plan <- drake_plan(
  export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
  dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)
landau
  • 5,636
  • 1
  • 22
  • 50
  • I'd rather do the single `drake` plan rather than using `file_in()` and `file_out()`, because I'd want to also export character lists and vectors (e.g., of experimental materials). I think I have two concerns: 1. Each project has quite a few components and sub-experiments that it seems to make sense to have them separate, at the very least for version control. 2. A single `drake` plan either end up being quite long (and therefore hard to navigate), or if I use a lot of nested functions to keep it tidy, I might have an object within the nesting that I actually want as a target at the top level. – shir Aug 20 '20 at 22:48
  • Would you elaborate on what you find natural about the way you’re currently dividing the work into sub-projects? I feel like there should be an easier, simpler mental model that fits your goals equally well. – landau Aug 21 '20 at 03:31
  • I also hope that exists! I initially had all my thesis work in one project/`git` repo. I don't always work on all the experiments at once (they each have different methods, analyses, etc.), so mainly two things annoying me: 1. having to navigate between sub-folders (e.g., thesis/experiment_a/sub_experiment1/data...), and 2. being overly specific in my `git` commits (e.g., to differentiate that I'm talking about one experiment's data cleaning vs. another). I moved to separate projects because of the above, but also just because it seemed as if that's what RStudio projects are for. – shir Aug 21 '20 at 04:23
  • But also maybe I'm putting too high of a premium on readability, and it's fine to have one really long `drake` plan, if it achieves all the other goals. – shir Aug 21 '20 at 04:25
  • You can keep your sub-projects - say, `~/thesis/expA/` and `~thesis/expB/` - and run a monlithic plan from the top level (call `make()` or `r_make()` from `~/thesis/`). The cache will live at `~/thesis/.drake/`. To make this approach easier, you can create sub-plans for each project (maybe in `~/thesis/expA/R/plan.R` and `~/thesis/expA/R/plan.R`) can call `bind_plans()` to put all the plans together. – landau Aug 21 '20 at 16:14
1

You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.

That being said, first off, pay attention to Will Landau’s advice.

Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • I understand those points about packages, I tried this previously. The reasons I saw to avoid using packages were the specific components Miles McBain critiques (i.e., metadata, artefacts, noise), but I get that this was unclear, so I'll edit the question. Can you explain what you mean by "make your RStudio projects configurable", or provide a link for me to look into this? I don't think I can find what you're talking about by searching for it myself. – shir Aug 20 '20 at 22:55
  • @shir Literally, have a project local config file (e.g. DCF or JSON) that contains the paths, and load the data based on the values inside that config file. – Konrad Rudolph Aug 21 '20 at 11:52
  • I see that you added a reference to your `{box}` package. Is it possible to access modules non-locally? In other words, linking between modules is only possible within the same machine, right? – shir May 12 '21 at 00:07
  • 1
    @shir They need to be accessible via the filesystem, yes. I’ve considered adding functionality that could e.g. use code from remote GitHub repositories, but (for now) I’ve decided against it because I believe that usage and installation of dependencies should be distinct steps that shouldn’t be mixed. I *might* revisit this decision in the future, but the current plan is instead to add installation functionality (similar to ‘renv’) to ‘box’. But this won’t be soon. – Konrad Rudolph May 12 '21 at 08:43
1

I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:

project/.git  
project/main/  
project/sub-project_1/  
project/sub-project_2/  
project/sub-project_n/

I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?

I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.

My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).

My folder structure now looks something like this:

main/.git  
sub-project_1/.git  
sub-project_2/.git  
sub-project_n/.git

And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.

vrognas
  • 33
  • 6
  • This is the kind of structure I'm leaning towards, except that I'm still looking for a way not to rely on putting the data in a package in order to avoid the unnecessary package fluff (e.g., I'm experimenting with using OSF at the moment). One thing that I think is useful with `{targets}` (as opposed to `{drake}`) is that you can [declare package dependencies](https://books.ropensci.org/targets/practices.html#packages-based-invalidation). – shir May 12 '21 at 00:15