How does git store duplicate files?

Question

We have a Git repository that contains SVM AI input data and results. Every time we run a new model, we create a new root folder for that model so that we can organize our results over time:

/run1.0
  /data
    ... 100 mb of data
  /classification.csv
  /results.csv
  ...
/run2.0
  /data
    ... 200 mb of data (including run1.0/data)
  /classification.csv
  /results.csv
  ...

As we build new models we may pull in data (large .wav files) from a previous run. This means that our data folder 2.0 may contain all the files from 1.0/data plus additional data we may have collected.

The repo is easily going to exceed a Gigabyte if we keep this up.

Does Git have a way to recognize duplicate binary files and store them only once (e.g. like a symlink)? If not, we will rework how the data is stored.

score 33 · Accepted Answer · edited Jan 15 '21 at 08:04

33

I am probably not going to explain this quite right but my understanding is that every commit stores only a tree structure representing the file structure of your project with pointers to the actual files which are stored in an objects sub folder. Git uses a SHA1 hash of the file contents to create the file name and sub folder, so for example if a file's contents created the following hash:

0b064b56112cc80495ba59e2ef63ffc9e9ef0c77

It would be stored as:

.git/objects/0b/064b56112cc80495ba59e2ef63ffc9e9ef0c77

The first two characters are used as a directory name and the rest as the file name.

The result is that even if you have multiple files with the same contents but different names or in different locations or from different commits only one copy would ever be saved but with several pointers to it in each commit tree.

edited Jan 15 '21 at 08:04

Sergei Tachenov

24,345
8
57
73

answered Apr 29 '15 at 16:04

Dave Sexton

10,768
3
42
56

Interesting... this would make a lot of sense and I was wondering if this is what was happening. I'll have to do some digging to see if this is actually the case (when I get some free time). – JoshuaJ Apr 29 '15 at 16:54
2

pastebin.com/p0KpqBPX for those of you too lazy to experiment :) Same object, only slightly more space required than 1 file in .git/objects – opatut Apr 29 '15 at 18:09
Actually this makes complete sense now. The way git knows a file moved is by its SHA, so it would make sense that Git by default would easily be able to recognize the same file in multiple locations in the repo tree. – JoshuaJ Apr 29 '15 at 19:23

score 9 · Answer 2 · edited May 23 '17 at 12:03

By default/itself: ~~No.~~ Yes.

Git works on the basis that it creates snapshots of files, and not incremental differences like other VCS do.

EDIT

As mentioned by Dave and opatut, my understanding of how git stores files was incorrect and I apologize for the confusion caused. Upon doing more research, Git does store duplicated files as pointers to 1 file. Quoting VonC in the accepted answer of this question,

... several files with the same content are stored only once.

Please also note that as mentioned in that answer, conceptually ...

Referencing the git-scm documentation:

Git thinks of its data more like a set of snapshots of a miniature filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored. Git thinks about its data more like a stream of snapshots.

However, on a storage level, deltas are still used in which Git tries to generate the smallest possible delta based on heuristic selection of blobs as fast as possible, there are options that optimize for compression. Which will reduce the sizes of the repository.

Also as tested by opatut in his pastebin link of outputs from the comments, duplicate objects are stored only once. That means that git will recognize duplicate binary files and store them only once. Which was what the original question asked for. The following are other options of handling duplicate files.

Other alternative: Symlinks

You could set up symlinks to the previous files, that way when you work on them, they will point to the same large file, however note that git does not track the files that the symlinks point to, meaning they will only store the symlink. This satisfies your need to reduce space, at the sacrifice of portability, that is, if you move to another dev machine, you'll have to make sure the files are where the symlinks point to. Which might not be what you want. See this very good SO Q&A on what git does to symlinks.

Another alternatve: tools!

I've found multiple tools that might help accomplish what you need on managing binary files.

You can try git-annex, where it basically only tracks the most recent version of binary files and the rest are maintained by symlinks, so in a way this a more automatic way of handling symbolic links. Here's their project site.

Or the built in git-submodules and a separate repo to achieve what you want, where you only fetch the large binary files to use them.

Admittedly I have no attempted these options so here is the reference link to read more explanations about them. Reference: this SO question

What a fantastic answer. I was beginning to mentally explore the idea of symlinks but was not sure what was available. I'll be looking into that now. Thank you. — JoshuaJ, Apr 29 '15 at 15:54
@JoshJ no problem, glad I could help, and I'm humbled by your compliment. good luck implementing it! — matrixanomaly, Apr 29 '15 at 16:12
Your answer is misleading and a bit confusing, IMO. In fact, git considers two files with the same SHA to be identical, theirs paths don't matter. So for OP's question, it's fine, git will not store the same file multiple times. See Dave Sexton's answer for why. — opatut, Apr 29 '15 at 17:49
@opatut if it's the case as to how Dave explains it I apologize. That was based on my understanding from reading the git docs and if copies of the same file are stored as 1 file based on their hashes then I will update my answer accordingly. — matrixanomaly, Apr 29 '15 at 17:57
however, an interesting question is why would tools like git-annex exist then? — matrixanomaly, Apr 29 '15 at 17:57
http://pastebin.com/p0KpqBPX for those of you too lazy to experiment :) Same object, only slightly more space required than 1 file in .git/objects — opatut, Apr 29 '15 at 17:59
@opatut I have fixed my answer and added your pastebin into my answer, with comments. Sorry for the confusion, OP and everyone else. Dave's answer is more spot on and his answer should be accepted — matrixanomaly, Apr 29 '15 at 18:15
My downvote shall be converted into an upvote then :) Still, good research on the other options. — opatut, Apr 29 '15 at 18:46
I do appreciate your answer and updates matrixanomoly, but will officially switch to Dave's simply because it is more direct to the question. — JoshuaJ, Apr 29 '15 at 19:40
'several files with the same content are stored only once'. How will git identify if two files had same content?. By using SHA1 checksum? — kumarp, Feb 16 '22 at 10:31

score 0 · Answer 3 · answered Apr 29 '15 at 21:23

Even if git store the files once which save you in your way to do things, you are using a VCS in the bad way and are loosing all the advantages of using a VCS by not being able to see which changes are done between 2 versions.

You'd better have a 'run' directory with your files and do a commit for each new version (even with tags if you want to see more easily your important 'runs').

That way you could send what was done between versions and improve your work.

No need to keep everything in sunflowers!

What you try to do is a bad thing!!

Yeah, unfortunately these are not version numbers. These are completely different model runs and there may be the need to share information across them and to retrieve all of them in one single checkout without having to jump around revisions. — JoshuaJ, Apr 30 '15 at 02:00

How does git store duplicate files?

3 Answers3