Is there a difference between how Git stores text and binary files

Question

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

According to https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7:

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

Say you have a complex 3D model for the exciting new first person puzzle game you're making, and you save it in a binary format, resulting in a 1 gigabyte file. You git commit it once, adding a gigabyte to your repository's history. Later, you give the model a different hair style and commit your update; Git can't tell the hair apart from the head or the rest of the model, so you've just committed another gigabyte. Then you change the model's eye color and commit that small change: another gigabyte. That is three gigabytes for one model with a few minor changes made on a whim. Scale that across all the assets in a game, and you have a serious problem.

It was my understanding that there is no difference between text and binary files and Git stores all files of each commit in their entirety (creating a checksummed blob), with unchanged files simply pointing to an already existing blob. How all those blobs are stored and compressed is another question, that I do not know the details of, but I would have assumed that if the various 1GB files in the quote are more or less the same, a good compression algorithm would figure this out and may be able to store all of them in even less than 1GB total, if they are repetitive. This reasoning should apply to binary as well as to text files.

Contrary to this, the quote continues saying

Contrast that to a text file like the .obj format. One commit stores everything, just as with the other model, but an .obj file is a series of lines of plain text describing the vertices of a model. If you modify the model and save it back out to .obj, Git can read the two files line by line, create a diff of the changes, and process a fairly small commit. The more refined the model becomes, the smaller the commits get, and it's a standard Git use case. It is a big file, but it uses a kind of overlay or sparse storage method to build a complete picture of the current state of your data.

Is my understanding correct? Is the quote incorrect?

I think the main problem here is that you (and I don't *literally* mean *you* as in the person asking this question) are comitting multi-megabyte files using git, you will eventually have problems anyway. It's just less of a problem in *good, structured, source code* because you wouldn't have such files to begin with. — Lasse V. Karlsen, Dec 06 '18 at 16:17
Now, having said that, once you introduce packfiles into the mix, and at least don't commit already compressed files, then the "change the hair color" change would likely not end up with a pack file that contains two complete copies of this file, but instead *might* end up with a delta-compressed second version, meaning that the file would literally be "the old one, just with a hair change" in terms of stored data. — Lasse V. Karlsen, Dec 06 '18 at 16:19

score 4 · Answer 1 · answered Dec 06 '18 at 09:17

Git does store files in their entirety and so if you have 2 binary files with only a small change, it will take twice the space. Observe.

% git init                
Initialized empty Git repository in /tmp/x/.git/
{master #}%                                                                                                                                           [/tmp/x]
{master #}% du -sh .git           
100K    .git                         
{master #}% dd if=/dev/urandom of=./test count=1 bs=10M
1+0 records in
1+0 records out                                                                                                                                               
10485760 bytes (10 MB, 10 MiB) copied, 0.102277 s, 103 MB/s
{master #%}% ls -sh test
10M test
{master #%}% git add test
git co%
{master #}% git commit -m "Adds test"
[master (root-commit) 0c12c32] Adds test
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 test
{master}% du -sh .git
11M     .git

I've created a 10MB file and added and committed it. The repository is now 10MB in size.

If I make a small change and then do this again,

{master}% e test # This is an invocation of my editor to change a few bytes.
nil
{master}% git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   test

no changes added to commit (use "git add" and/or "git commit -a")
{master *}% git add test
{master +}% git commit -m "Updates test a little"
[master 99ed99a] Updates test a little
 1 file changed, 0 insertions(+), 0 deletions(-)
{master}% du -sh .git
21M     .git

It will take 20MB. Two times the 10MB file.

This however is a the "loose object" format of the repository where each blob is a separate file on disk.

You can pack all of these into a git packfile (which is done when you push etc.) and see what happens.

{master}% git gc
Counting objects: 6, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0)
{master}% du -sh .git
11M     .git

Now, it stores the blob and the diff just once in the packfile. This is different from each commit storing just the diff. It's that the objects themselves are packed into a single file.

score 2 · Answer 2 · answered Dec 06 '18 at 16:10

You're right in that text and binary files are really just blob objects. If that were all there was to the story, things would be simpler, but it isn't, so they aren't. :-)

(You can also instruct Git to perform various filtering operations on input files. Here again, there's no difference between text and binary files in terms of what the filters do, but there is a difference in terms of when filters are applied by default: If you use the automatic mode, Git will filter a file that Git thinks is text, and not-filter a file that Git thinks is binary. But that only matters if you use the automatic detection and CRLF / LF-only line ending conversions.)

I would have assumed that if the various 1GB files in the quote are more or less the same, a good compression algorithm would figure this out and may be able to store all of them in even less than 1GB total, if they are repetitive ...

Maybe, and maybe not. Git has two separate compression algorithms. As Noufal Ibrahim said, one of these two—delta compression—is applied only in what Git calls pack files. The other one is zlib, which is applied to everything.

Zlib is a general compression algorithm and relies on a particular modeling process (see Is there an algorithm for "perfect" compression? for background). It tends to perform pretty well on plain text, and not so well on some binaries. It tends to make already-compressed files bigger, so if your 1 GB inputs are already-compressed, they are likely to be (marginally) larger after zlib compresson. But all of these are generalities; to find out how it works on your specific data, the trick is to run it on your specific data.

The delta encoding that Git uses happens "before" zlib compression, and does work with binary data. Essentially, it finds long binary sequences of bytes that match in an "earlier" and "later" object (with "earlier" and "later" being rather loosely defined here, but Git imposes a particular walk and compare order on the objects for reasons discussed here) and if possible, replaces some long sequence of N bytes with "referring to earlier object, grab N bytes from offset O".

If you try this on large binary files, it turns out that it generally works pretty well on pairs of related, large, uncompressed binary files that have some kind of data locality, as the "later" binary file tends to have a lot of long repeats of the "earlier" file, and very badly on large compressed binary files, or binary files that represent data structures that get shuffled about too much (so that the repeated binary strings have become very fragmented, i.e., none are long any more). So once again, it's quite data-dependent: try in on your specific data to see if it works well for you.

Is there a difference between how Git stores text and binary files

2 Answers2

Linked