Git workflow for a modelling task with data updates

Question

I am looking to kick off a modelling/forecasting task with the aid of git. I want to set up a git architecture to facilitate this, but am having some issues.

Goal: At the end of the modelling task in region/subregion branchs (human revisions needed, human revision = commit), merge down to master to have all of the forecasts available for review with what version of the code and dataset it was run on. If revisions need to be made later on, a modeller should be able to branch out from when the exact forecast was completed and work on it with the correct (possibly older) version of the code.

Issue: The data and code version can change. Older model runs will likley not be compatible with older code/data (for example, in region 1, code version 1 and data version 2 may be used, but in region 2, code 4 and data 6), and at the end of the project, forecasts must be able to be reproduced.

My solution: It seems to be against the philosophy of git, but every time there is a dataset or code update, place it in master and append a version number to the file name. Have region/subregion branches and tag every forecast completion commit Then when the forecast is completed, merge down to master, and add another file that states what version the code and data was run on. If a revision needs to be made, find the tag of completion, and remodel with the proper version of the code, merge back in to region and then down to master. If a model needs to be reproduced, run it with the correct code/data (from the additional file created).

Is this the best way to go about using git to track this process, or is there a better/simpler way? Will this process work, or are there unintended issues that may arise because of it?

score 1 · Answer 1 · answered Nov 13 '17 at 20:11

The data and code version can change

That means you have two sets of files, with a strong coupling, but with their own evolution within that coupling.

That is a job for Git submodules: you put the code and the data each in their own separate git repository, and you reference a fixed SHA1 for each in a main parent repo:

parent/
  code/
  data/

That way, from the parent repo, you can make a branch in which both code and data will change. When the forecast is completed, what you are merging to master (in parent) are the latest SHA1 of code and data.

The interest of submodules is that you record in the parent repo the exact SHA1 of the data repo which is supposed to be compatible with the code repo.
And you completely avoid any "hack" like renaming files.

When you say "record in the parent repo the exact SHA1..." does that mean in order to run any old forecast at a later date, that you would need to git reset --soft to that SHA1? — RayVelcoro, Nov 13 '17 at 20:32
@RayVelcoro you reset the parent to the SHA1 you want, and it will in turn reset the submodule to their respective SHA1, as recorded by the parent commit. Those SHA1 are gitlink: see https://stackoverflow.com/a/2227598/6309 and https://stackoverflow.com/a/17442045/6309 — VonC, Nov 13 '17 at 20:35

Git workflow for a modelling task with data updates

1 Answers1