What to Put in Git

Question

I am currently tasked with moving my shop from Clearcase to the wonderful world of Git. In the process of doing this, I am finding all sorts of hokey junk that my shop has kept in version control that has bloated the size of the repository.

The main culprit that I have found is that we have kept router iOS configuration images in our Clearcase repository. These are gigantic binary images that are hundreds of megabytes large.

I have done some reading on Git and the recommendation is that the only thing we should be keeping in a Git repo are source files. Large binary files should not be kept in version control.

So, my question is this: What is the "standard" way of handling such files like router configuration images (or something similar)? These are large binary files that our shop does not maintain ourselves, we cannot regenerate these images ourselves, but we need them for our deployed baseline to our production system.

Typically you would store them externally (outside the repository) and perhaps include a reference, in the repository, to the external storage entity. A URL might suffice, or you could store something that's essentially a fancified URL, which is what Git-LFS does. — torek, Feb 02 '18 at 21:31

score 6 · Answer 1 · answered Feb 02 '18 at 22:07

What is the "standard" way of handling such files like router configuration images (or something similar)?

For having done ClearCase to Git migration (many times before), I usually put those type of artifacts in an artifact repository, either Nexus or Artifactory.

That way, those binaries can be referenced by a project settings, and downloaded on demand.
The project setting is part of a "declarative approach" which fits Git well: a simple text file which is processed by your build tool, and will update the workspace accordingly.

score 2 · Answer 2 · answered Feb 02 '18 at 22:03

There's no strict "right answer" here, but there are a few guidelines you can follow.

General rules:

All source code gets checked in.
Files that can be generated from source files do not go in version control. These are things like executables or other build output.
Images that are part of the application normally get checked in. Because they are binary, you can't do a text diff. But git will handle them just fine, storing new versions as they are added.
3rd party libraries are a gray area. Most people would check in a build file like package.json, Gemfile, or pom.xml, but exclude the library source/binary. Some people like the additional safety/security aspects of checking in 3rd party code and would, for example, check in both package.json and the node_modules directory.

It is not explicitly wrong to check in your large configuration images, but it could affect the performance of your repository. As mentioned by torek, Git LFS might be a good solution here.

Another solution is to simply put your large configuration images somewhere where all your developers can access them (http or ftp server, etc). Then check in a small script (perhaps part of the build script) that fetches the correct image (if not already cached) and places it where it's needed on the local filesystem. In this case, all you need to check in to Git is the script.

score 1 · Answer 3 · answered Feb 02 '18 at 22:15

Version control should mainly keep "primary objects". Primary objects are files that are not derived automatically from other files. If some tool generates B from A, then only A should be in version control, at least ideally. Some circumstances can justify B being in version control also. For instance, the program has to build in environments where the A to B tool doesn't exist.

An example occurs in compiler bootstrapping. Suppose that the project implements a language called L, whose compiler outputs C. Most of L is implemented in L itself! Oops, many of the target users do not have an L compiler to build the L sources; they only have a C compiler. Those users cannot pull the repo and build the L compiler unless the repository includes the C versions of the L source files (or they otherwise obtain them somehow).

Large, binary files can be primary objects. For instance, the image data for a video game and such. There is definitely a need for version control that handles large, binary files.

One way to handle large binaries in a version control system that do not work with such files very well is to keep the binaries on some server (under versioned paths), and store just those paths in the repo (in some parametrized way so that if the paths have to change, the users of the repo just change some environment variable).

Sometimes binaries are the derived objects of some other repo. For instance, you have some embedded system project that has all sorts of software in the repo. One of the pieces is some piece of firmware that is uploaded to some chip when the system boots. That comes from some other repo; you don't build it. So just the binary images of that firmware are checked in. The firmware is a derived object from some primaries, but you either don't have them, or don't want to pull in those primaries because of dependencies (like the whole toolchain needed to build them and such).

What to Put in Git

3 Answers3