How does Git store large files over many commits?

Question

So I have started using git for a while now and understanding how it works gradually. One main point I understood is that - It creates a snapshot every time a new commit is made. Of course snapshot will contain only changed files and pointers to unchanged file.

According to Pro Git § 1.3 Getting Started - Git Basics

Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.

But let's say I have really big file e.g. 2GB text file. And I change that file 10 times and hence make 10 commits in a day, does that mean - I now have 10 2GB files on my computer? That seems really inefficient to me So I am believing this might not be the case.

Could someone clarify what would happen in this scenario?

@TimCastelijns, According to http://git-scm.com/book/en/Getting-Started-Git-Basics `Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.` So It doesn't mean what I think it means? — RandomQuestion, May 02 '14 at 06:38
But @TimCastelijns, the question is "how does git track a small change in a big file?" — Andreas Wederbrand, May 02 '14 at 06:38
Possible duplicate of [How does git store files?](http://stackoverflow.com/questions/8198105/how-does-git-store-files) — , May 02 '14 at 06:42
@AndreasWederbrand no it's not. Anyway I wasn't answering the question, just making a comment — Tim, May 02 '14 at 06:42
See [this answer](http://stackoverflow.com/a/8198276/456814), particularly the last part. — , May 02 '14 at 06:55
A correction to @TimCastelijns comment: git tracks *content*, but uses deltas (if it sees fit) for internal storage. Go read the "duplicate" link : [How does git store files ?](http://stackoverflow.com/questions/8198105/how-does-git-store-files) — LeGEC, May 02 '14 at 08:47
I've used git to track daily changes on a database : daily dump of each table in its own `table.sql` file. (Warning : this is not an intended use of git, and will work poorly if you have a very active db.) I regularly run the `git gc` command (I think this implies a `repack`), and the repo size is roughly the size of the compressed dump (it's clearly not [nbDays] times the compressed size). — LeGEC, May 02 '14 at 08:56
@RPM It does but compresses them when the objects are packed and saves space. — Noufal Ibrahim, May 02 '14 at 09:28

score 9 · Accepted Answer · edited May 23 '17 at 12:31

The short answer is "yes, you now have 10 2GB files". However:

"Files" under a commit are stored as "blob" objects, and all git objects (blobs, trees, commits, and annotated-tags) are kept internally in zlib deflated format. So a 2 GB text file is actually a considerably smaller object.
"Loose" objects (all of them, again) are eventually "packed". You can do this manually with git pack-objects and git repack but generally you just let git do it on its own as part of standard "garbage collection" (git gc). Inside a pack, objects are delta-compressed against similar objects. The end result with most files is pretty impressive.

All that said, git eventually fails badly if you feed it a lot of large incompressible binary files (I had to deal with this at a previous workplace, where we stuffed 2GB of .tgz files into repos). They don't deflate, they generally don't delta-compress, and eventually even the pack format falls over. There are at least two solutions in relatively widespread use: git-annex and git-bup. See Managing large binary files with git.

score 3 · Answer 2 · answered May 02 '14 at 07:26

3

I just tested it.

First I created a large file (24 MB of text) and committed it. My .git directory is now 216 KB large. git uses compression and my text file was easy to compress.

I then made a small change on the first line in the file and committed that. My .git directory is now 356 KB large. .git/objects now contains two objects, both 132 KB large.

132K    ./.git/objects/8d
132K    ./.git/objects/f7

After running git gc those two objects are compressed into a pack-file only 68 KB.

So at least under some circumstances git will keep entire copies of large files for a while.

answered May 02 '14 at 07:26

Andreas Wederbrand

38,065
11
68
78

68K is pretty impressive for storing two files that were 132K each _after_ being compressed. – CB Bailey May 02 '14 at 07:27
Perhaps git uses a different compression in pack files than in objects – Andreas Wederbrand May 02 '14 at 07:35

How does Git store large files over many commits?

2 Answers2