I regularly collaborate on large data analysis projects using git and statistical software such as R. Because the datasets are very large and may change upon re-download, we do not keep these in the repository. While we like to design the final versions of the scripts we develop to use command line arguments to read paths to the raw datasets, it's easier to test and debug by directly reading the files into the R environment. As we develop, therefore, we end up with lines such as
something = read.raw.file("path/to/file/on/my/machine")
#something = read.raw.file("path/to/file/on/collaborators/machine")
#something = read.raw.file("path/to/file/on/other/collaborators/machine")
cluttering up the code.
There must be a better way. I've tried adding a file that each script reads before running, such as
proj-config.local
path.to.raw.file.1 = "/path/to/file/on/my/machine"
and adding it to .gitignore
, but this is a "heavyweight" workaround given how much time it takes, and it's not obvious to collaborators that one is doing that or that they should, or they might name or locate the file differently (since it's ignored) so then the shared line of code that reads that file ends up wrong, etc. etc.
Is there a better way to manage local outside-repo paths/references?
PS I didn't notice anything addressing this issue in any of these related quetions:
- Workflow for statistical analysis and report writing
- project organization with R
- What best practices do you use for programming in R?
- How do you combine "Revision Control" with "Workflow" for R?
- How does software development compare with statistical programming/analysis?
- Essential skills of a Data Scientist
- Ensuring reproducibility in an R environment
- R and version control for the solo data analyst