1

I regularly collaborate on large data analysis projects using git and statistical software such as R. Because the datasets are very large and may change upon re-download, we do not keep these in the repository. While we like to design the final versions of the scripts we develop to use command line arguments to read paths to the raw datasets, it's easier to test and debug by directly reading the files into the R environment. As we develop, therefore, we end up with lines such as

something = read.raw.file("path/to/file/on/my/machine")
#something = read.raw.file("path/to/file/on/collaborators/machine")
#something = read.raw.file("path/to/file/on/other/collaborators/machine")

cluttering up the code.

There must be a better way. I've tried adding a file that each script reads before running, such as

proj-config.local
    path.to.raw.file.1 = "/path/to/file/on/my/machine"

and adding it to .gitignore, but this is a "heavyweight" workaround given how much time it takes, and it's not obvious to collaborators that one is doing that or that they should, or they might name or locate the file differently (since it's ignored) so then the shared line of code that reads that file ends up wrong, etc. etc.

Is there a better way to manage local outside-repo paths/references?

PS I didn't notice anything addressing this issue in any of these related quetions:

  1. Workflow for statistical analysis and report writing
  2. project organization with R
  3. What best practices do you use for programming in R?
  4. How do you combine "Revision Control" with "Workflow" for R?
  5. How does software development compare with statistical programming/analysis?
  6. Essential skills of a Data Scientist
  7. Ensuring reproducibility in an R environment
  8. R and version control for the solo data analyst
Community
  • 1
  • 1
Philip
  • 7,253
  • 3
  • 23
  • 31
  • 2
    What about having everybody keep the repository on their local machine in the same location relative to the files of interest and using relative paths instead of full paths? – Dason Jun 24 '14 at 18:01
  • 1
    Also, it's trivial to add a "working directory root" command-line option or environment variable *once*, which can override the default root directory for users with specific needs. – ChrisGPT was on strike Jun 24 '14 at 18:19

2 Answers2

3

I've run into something similar when working on the same repo from two different platforms, while storing data files outside the repo on each machine. One thing you can do is get everyone to keep the files in a specific place relative to the project's working directory.

With that knowledge, you can construct the path at the start of each session. For example:

path <- strsplit(getwd(), "project_directory")[[1]]
path <- file.path(path, "data_directory", "file")
something <- read.raw.file(path)

strsplit returns a list, so taking the first element of that list gives you the path up to the parent of your project directory. If your data is in /data_directory inside the parent of your project directory, file.path constructs a platform-independent file path unless otherwise specified (the fsep argument defaults to .Platform$file.sep).

Ajar
  • 1,786
  • 2
  • 15
  • 23
  • Hmm, interesting idea. I agree this would probably work in some instances but I'm generally opposed to a solution that is so sensitive to moving/in particular renaming a folder. Ideally the script (and whole repo) shouldn't really care where it's located, don't you think? Granted that I'm somewhat unfairly allowing for it to be very concerned with the location of another file. – Philip Jun 24 '14 at 20:20
  • I actually think there's value in each developer having a consistent directory structure around the repo, but if that doesn't work for you, you'll need to do filesystem searches as @Grisby_2133 suggests below. – Ajar Jun 25 '14 at 14:00
  • Just out of curiosity, aside from solving this particular issue, are there other things you find valuable about maintaining that requirement? It clearly helps with this problem, but seems it imposes unnecessary constraints on collaborators' environments otherwise. (E.g. some may like to store all their raw data in one folder for easy backup to an external machine; others may group projects by category and store data with each project, etc.) – Philip Jun 27 '14 at 14:22
  • It saves you from having to account for multiple structures in the future if you write more code that depends on it. For example, it allows you to easily automate downloading/updating of some or all of the data files in the future, since all of your existing code already knows where they are stored. – Ajar Jun 27 '14 at 19:25
2

A solution I've been using is to build in the concept of a search path which can be used to locate files. In one particular application, I've built-in the ability to override the search path with an environment variable, similar to the PATH variable commonly used.

I wrote a function, findFileInPath (below) that will search the supplied path and return any that are found. It takes in a path vector and allows you to separate pieces by a certain character like an OS typically does.

You could use it like this: (as an example only)

DataSearchPath = c(
    "path/to/file/on/my/machine",
    "path/to/file/on/collaborators/machine",
    "path/to/file/on/other/collaborators/machine",
    Sys.getenv('DATASEARCHPATH')
)

DataFilename = "data_file.csv"
DataPathname = findFileInPath(DataFilename, path=DataSearchPath)[1] # Take the first one

if (is.na(DataPathname)) {
    stop(paste("Cannot find data file", DataFilename), call.=FALSE)
}

...

I use something like that to locate files to source, to locate configuration files, data sets, etc. I have multiple different paths, some of them exposed in the environment or various configuration files, others are just internal. It works pretty well.

In the example above, the DATASEARCHPATH environment variable can be set (outside of R) to a colon-separated series of paths to search.

My implementation of findFileInPath defaults to searching the system's PATH environment variable, separated by the colon character. (This probably won't be applicable to Windows. I only use this on Mac and Linux.)

#' findFileInPath: Locates files by searching the supplied paths
#'
#' @param filename character: the name of the file to search for
#'
#' @param path character: the path to search, either a vector, or optionally
#'   separated by \code{sep}.
#'
#' @param sep character: the separator character used to split \code{path}
#'   into multiple components.
#'
findFileInPath = function(filename, path=c('.',Sys.getenv('PATH')), sep=':') {

    # List all potential files, and return only those which exist.
    files = data.frame(name=file.path(unlist(strsplit(path, sep)), filename),
                       stringsAsFactors=FALSE)
    files$exist = file.exists(files$name)
    files[files$exist==TRUE,1]
}
Grisby_2133
  • 487
  • 2
  • 11