Why is Git considered inefficient in handling binary files?

Question

I'm planning to move our repository from SVN to Git and I heard a lot about how Git is very inefficient in dealing with binary files. But I don't really understand what may be the issues (besides the repository size) I will face regarding this topic, since we do have a lot of binary files in our repository.

This is our scenario: We have a single repository of 800MB that contains 2 directories:

src (300MB)
libs (500MB of binary files)

This is the current size considering no history (let's assume we start the Git repo from scratch, without any history).

The binary files never exceed the 25MB, most of them are lower than 10MB, and are rarely changed (2 or 3 times a year).

Can I expect issues with a repository like this when using Git? If the only issue with Git is the fact that all the history is kept in each local repository then I don't expect it to grow so much since these files are not changed often.

But might the Git performance (when committing or checking the status) be affected by the fact that I have a lof of binary files in the repository? Could the Git Subtree feature help on this (by making the directory "libs" a Subtree of the main repository) ?

EDIT: I know I could use something like Maven to store these binaries outside however we have a restriction here that we must keep these files together.

UPDATE: I made a series of tests and I concluded that Git is smart enough to analyze the zip content and save deltas: for instance, if I add a zip file of 20MB and then I modify one text file inside the zip, when I commit the new version of the zip and run 'git gc', the size is almost unchanged (still has 20MB). So I can assume Git work fine with zip files. Can someone confirm this?

See this post: http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git — Christophe Roussy, Apr 20 '15 at 10:11
If you use Java you may consider handling the lib binaries externally using Maven or a similar tech. — Christophe Roussy, Apr 20 '15 at 10:14
@ChristopheRoussy, that's a good suggestion, and it isn't limited to Java. Similar tools exist for many languages. — ChrisGPT was on strike, Apr 20 '15 at 11:58
ok, I will update the question to refer that handling these binaries outside is not an option. thx for the suggestion anyway — xsilmarx, Apr 20 '15 at 12:01

score 2 · Answer 1 · answered Apr 21 '15 at 13:28

The main issue you might run into is that every git repository stores complete history of all files. Even when they will be packed together, there no easy way to make a "light" checkout of only one subdirectory with sources files want you need to work on.

If you have 500 MB of binary files that change 2-3 times a year, it means that after three years you'll need to handle 3+ GB history (ok, compressed a bit) whenever you check out the repo or have it somewhere. This may get a bit irritating.

In my experience, git submodules are not a tremendous help this regard: you still have a git repo with the files (i.e. a big and growing repository), and submodules mostly complicate things. The best approach is to try to avoid large binaries, for example by storing the sources you use to build them (and perhaps cache them somewhere if that takes too long).

Nevertheless, git will definitely survive your use case, so if you don't mind a bit of disk space give it a shot.

As I stated in my question in the update, I did a serious of tests and Git seems to handle pretty well the zipped files such as jars. When I say I have 500MB of binaries, most of them won't even change for a long time so I think Git can handle it and it should not grow so much. I also agree that for the ones that change often we should store them outside and fetch them during the build process. — xsilmarx, Apr 22 '15 at 08:07

score -1 · Answer 2 · answered Apr 20 '15 at 16:24

The main reason you see difference in term of size with git (compared to svn) is because git and svn aren't build the same way.

Svn : To handle files, svn uses deltas. I.e the first time you commit a file, svn creates the files, and when you commit modification, svn only stores the differences between the two files. If I remember correctly (and to be precise), svn stores the full last file you committed and stores the deltas negativly. This is pretty quick when you have few revision and when you want to get the HEAD commit, but the more revision you'll have, the slower it'll get to fetch a specific revision, since svn will have to rebuild the file using the deltas.

GIT : Git works in a completely different way from svn. It doesn't store deltas, it stores blob (binarie large object). When you commit a file, it stores the file in a blob with the revision label. If you commit without modifying the file, git create a symlink to the blob from the previous commit. If you modify the file, git stores the full blob. This has the advantage of being equally fast for each revision, but you repository can grow quite quick.

I won't answer how to deal with binaries, because i believe this is fully present on internet (and i'm sure it is on SO).

I hope it helped you

When git packs files together, it does compute deltas and store them in a more compact form (not a copy of each version). — ComputerDruid, Apr 20 '15 at 16:30
@ComputerDruid It only does that sometimes. Try it yourself by creating an empty git repository, and adding a large text file (say 100MB). Then `du -hd 0 .git`, to see the size of the repo. Then, add a single line anywhere in the file. Then run `du -hd 0 .git` again, and see the size double. — texasflood, Jun 20 '15 at 14:10

Why is Git considered inefficient in handling binary files?

2 Answers2