550

I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. We are currently discussing several alternatives:

  1. Copy the binary files by hand.
    • Pro: Not sure.
    • Contra: I am strongly against this, as it increases the likelihood of errors when setting up a new site/migrating the old one. Builds up another hurdle to take.
  2. Manage them all with Git.
    • Pro: Removes the possibility to 'forget' to copy a important file
    • Contra: Bloats the repository and decreases flexibility to manage the code-base and checkouts, clones, etc. will take quite a while.
  3. Separate repositories.
    • Pro: Checking out/cloning the source code is fast as ever, and the images are properly archived in their own repository.
    • Contra: Removes the simpleness of having the one and only Git repository on the project. It surely introduces some other things I haven't thought about.

What are your experiences/thoughts regarding this?

Also: Does anybody have experience with multiple Git repositories and managing them in one project?

The files are images for a program which generates PDFs with those files in it. The files will not change very often (as in years), but they are very relevant to a program. The program will not work without the files.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
pi.
  • 21,112
  • 8
  • 38
  • 59
  • 29
    What about when version controlling the binary file is necessary? I'm thinking for teams of artists working on assets. – Dan Jun 23 '10 at 16:56
  • 3
    If it is necessary then you have to balance your available resources (disk, bandwidth, CPU time) against the benefit you get. – pi. Jun 28 '10 at 15:20
  • 4
    Note that without file-locking, git isn't great when multiple people need to work on the same binary file. – yoyo Mar 05 '12 at 20:35
  • 1
    See also the [git-based backup file bup](http://stackoverflow.com/a/19494211/6309). – VonC Oct 21 '13 at 13:58
  • Link to screencast is broken. Seems gitcasts.com is down/gone. – doughgle Dec 11 '13 at 03:15
  • 1
    Here they are http://www.bestechvideos.com/tag/gitcasts – doughgle Dec 11 '13 at 03:19
  • @doughgle The site you posted contains only links to a gitcasts.com subdomain which no longer exists. – Rafael Bugajewski Mar 10 '14 at 17:34
  • 1
    You now have GitHub LTS solution since Aril 2015: see [my answer below](http://stackoverflow.com/a/29530784/6309) – VonC Apr 09 '15 at 05:54
  • It is possible to store large binary files in a single git repository without bloating the repository, with efficient checkouts and with a workaround for inefficient clones [just have a look at my answer](http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git/31390846#31390846). – Adam Kurkiewicz Jul 13 '15 at 21:38

13 Answers13

317

I discovered git-annex recently which I find awesome. It was designed for managing large files efficiently. I use it for my photo/music (etc.) collections. The development of git-annex is very active. The content of the files can be removed from the Git repository, only the tree hierarchy is tracked by Git (through symlinks). However, to get the content of the file, a second step is necessary after pulling/pushing, e.g.:

$ git annex add mybigfile
$ git commit -m'add mybigfile'
$ git push myremote
$ git annex copy --to myremote mybigfile ## This command copies the actual content to myremote
$ git annex drop mybigfile ## Remove content from local repo
...
$ git annex get mybigfile ## Retrieve the content
## or to specify the remote from which to get:
$ git annex copy --from myremote mybigfile

There are many commands available, and there is a great documentation on the website. A package is available on Debian.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
rafak
  • 5,501
  • 2
  • 19
  • 30
  • 13
    Whoa! Upvote for awesomeness! This implements an idea that I had recently, and much more. It's written in Haskell no less. git-media is a good alternative, by the way. – cdunn2001 Jul 20 '11 at 17:13
  • Submodules are confusing and it's easy to "get lost" using them. Annex seems like a much better solution to this problem. – kojiro Dec 21 '11 at 15:18
  • 36
    But, Annex does not support Windows. Which is problematic for game developers. – A.A. Grapsas Jul 18 '12 at 21:12
  • 8
    I heard Steam is dropping support for windows, and adding support for Linux... ;) seriously though, how hard can it be to port this? I guess your average game developer could do it. – Sam Watkins Jul 28 '12 at 15:04
  • @SamWatkins it uses simlinks so it's not that easy to port. – Bite code Nov 11 '12 at 16:02
  • 4
    @e-satis Windows has symlinks. But on closer inspection it seems there's a limit to how many symlinks you can have per path though. But there are symlinks in windows: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365680(v=vs.85).aspx – Esteban Brenes Jan 11 '13 at 20:21
  • 7
    @EstebanBrenes The real deal-breaker is that in the normal configuration Windows symlinks require elevated privileges to create. – Laurens Holst Feb 12 '13 at 15:06
  • 2
    Git-annex itself has some documentation about the obstacles to porting to Windows: http://git-annex.branchable.com/todo/windows_support/ – dbn Feb 26 '13 at 22:00
  • 1
    Some other reasons to not use annex: Does not support synchronization of binary file changes by multiple users. Not designed to keep long term version history of files [history is there, ability to revert is not really] – iheanyi Mar 11 '14 at 22:26
  • @iheanyi: ability to revert is totally here, it only requires the user to decide how to handle that (less automatic than regular git content). If you want synchronisation of deleted (i.e. old versions of binary files), you can for example have a special "deleted" directory in your project with "copies" (i.e. symlinks) off all annexed files, so that they never appear as unused to git-annex, and though are correctly synchronized. Another solution is to rename old versions a la emacs: .file~1, .file~2, ..., or any other scheme. This is made easy by `git annex unlock` doing copies of content. – rafak Jul 06 '14 at 11:38
  • 7
    I just found [this page](https://git-annex.branchable.com/install/Windows/). It reads that now `git annex` is available on **Windows** as well. If anyone has ever tested it in Windows, I'd like to hear about his or her experience! – Kouichi C. Nakamura Mar 05 '15 at 03:47
  • how would it work if a file has to be updated in the working tree? For example, I have a pdf file produced by latex, how would that work since git-annex actually works with symlinks? – PlasmaBinturong Dec 08 '16 at 20:44
179

If the program won't work without the files it seems like splitting them into a separate repo is a bad idea. We have large test suites that we break into a separate repo but those are truly "auxiliary" files.

However, you may be able to manage the files in a separate repo and then use git-submodule to pull them into your project in a sane way. So, you'd still have the full history of all your source but, as I understand it, you'd only have the one relevant revision of your images submodule. The git-submodule facility should help you keep the correct version of the code in line with the correct version of the images.

Here's a good introduction to submodules from Git Book.

Charles Beattie
  • 5,739
  • 1
  • 29
  • 32
Pat Notz
  • 208,672
  • 30
  • 90
  • 92
  • 12
    "as I understand it, you'd only have the one relevant revision of your images submodule." I don't think this is correct. – Robin Green Nov 12 '11 at 07:30
  • 29
    Indeed. A submodule is a full Git repository, which just happens to be nested inside the parent repository. It knows its entire history. You could commit less frequently in it, but if you store the same things in it you would have in the parent, it will have the same issues the parent would have. – Cascabel Feb 16 '12 at 21:03
  • 6
    This is a pretty poor solution if you have large binary files that are changing at some regular interval. We have a repository that's horribly bloated because a new binary file gets stored in it with every build. If you're not on Windows, as mentioned below, Annex is a good solution. If you are on Windows... will just have to keep looking. – A.A. Grapsas Jul 18 '12 at 21:13
  • 5
    Another problem in having large binary files in the repo is performance. Git wasn’t designed to cope with large binary files and once the repo size climbs to 3G+, the performance quickly drops. This means that having large binaries in the repo limits your hosting options. – zoul Oct 12 '12 at 07:09
  • 1
    Submodules can reduce checkout data transfer requirements if you creatively misuse the submodule: when you want to update the submodule contents, create a new commit without a parent and then point superproject (main git repo) to the newly created commit without a parent. Logically this creates a disconnected history for the submodule but in return, any version of the submodule is easier to transfer because that version has no history. – Mikko Rantalainen Sep 02 '13 at 09:42
  • 1
    Annex is no good if you need to have lasting version history of your binary files. If you can live with only having the latest version available . . . that could work but then you could also use something else so long as you added a hook for it. Also, binary files can't be merged so annex would be useless should you need to synchronize changes by multiple users. – iheanyi Mar 11 '14 at 22:24
  • As said by @A.A.Grapsas, Apart from being a bad solution, it's worth adding that git submodules are terribly unintuitive, and [horribly scary](https://codingkilledthecat.wordpress.com/2012/04/28/why-your-company-shouldnt-use-git-submodules/). If you really want to have a pure git solution, have a look at [my answer](http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git/31390846#31390846) – Adam Kurkiewicz Jul 13 '15 at 21:47
  • 1
    You could always shallow clone the sub-module, ignoring all of its history (*q.v.*, `--depth`) – Rich Remer May 28 '16 at 02:47
60

Another solution, since April 2015 is Git Large File Storage (LFS) (by GitHub).

It uses git-lfs (see git-lfs.github.com) and tested with a server supporting it: lfs-test-server:
You can store metadata only in the git repo, and the large file elsewhere.

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 3
    `lfs-test-server` is declared to be not for production use. Actually, I'm working on production LFS server (https://github.com/artemkin/git-lfs-server). It is in progress, but already serviceable, and we're testing it in-house. – Stas Apr 26 '15 at 22:30
  • Can you checkout previous versions of such binary file using git lfs? – mucaho Mar 23 '16 at 02:59
  • 1
    @mucaho You should: the syntax of git checkout is unchanged and the lfs smudge script should still be called. – VonC Mar 23 '16 at 07:41
34

Have a look at git bup which is a Git extension to smartly store large binaries in a Git repository.

You'd want to have it as a submodule, but you won't have to worry about the repository getting hard to handle. One of their sample use cases is storing VM images in Git.

I haven't actually seen better compression rates, but my repositories don't have really large binaries in them.

Your mileage may vary.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 3
    bup provides storage (internally using parity archives for redundancy and git for compression, dedup and history), but it doesn't extend git. git-annex is a git extension that provides [a bup storage backend](http://git-annex.branchable.com/walkthrough/using_bup/). – Tobu Feb 21 '12 at 11:55
  • @Tobu when I posted this, git annex didn't yet exist (in mainstream releases) – sehe Feb 21 '12 at 12:00
  • 2
    bup is definitely interesting for managing large files. I wanted to point out a difference in UI: you use bup commands outside of any repository context, and git is an implementation detail. – Tobu Feb 21 '12 at 12:07
29

You can also use git-fat. I like that it only depends on stock Python and rsync. It also supports the usual Git workflow, with the following self explanatory commands:

git fat init
git fat push
git fat pull

In addition, you need to check in a .gitfat file into your repository and modify your .gitattributes to specify the file extensions you want git fat to manage.

You add a binary using the normal git add, which in turn invokes git fat based on your gitattributes rules.

Finally, it has the advantage that the location where your binaries are actually stored can be shared across repositories and users and supports anything rsync does.

UPDATE: Do not use git-fat if you're using a Git-SVN bridge. It will end up removing the binary files from your Subversion repository. However, if you're using a pure Git repository, it works beautifully.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Carl
  • 43,122
  • 10
  • 80
  • 104
26

I would use submodules (as Pat Notz) or two distinct repositories. If you modify your binary files too often, then I would try to minimize the impact of the huge repository cleaning the history:

I had a very similar problem several months ago: ~21 GB of MP3 files, unclassified (bad names, bad id3's, don't know if I like that MP3 file or not...), and replicated on three computers.

I used an external hard disk drive with the main Git repository, and I cloned it into each computer. Then, I started to classify them in the habitual way (pushing, pulling, merging... deleting and renaming many times).

At the end, I had only ~6 GB of MP3 files and ~83 GB in the .git directory. I used git-write-tree and git-commit-tree to create a new commit, without commit ancestors, and started a new branch pointing to that commit. The "git log" for that branch only showed one commit.

Then, I deleted the old branch, kept only the new branch, deleted the ref-logs, and run "git prune": after that, my .git folders weighted only ~6 GB...

You could "purge" the huge repository from time to time in the same way: Your "git clone"'s will be faster.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Daniel Fanjul
  • 3,493
  • 3
  • 26
  • 29
  • I did something similar once where I had to split one repository which I merged accidentally into two distinct ones. Interesting usage pattern though. :) – pi. Feb 12 '09 at 15:04
  • 1
    Would this be the same as just: rm -f .git; git init; git add . ; git commit -m "Trash the history." – Pat Notz Feb 12 '09 at 22:21
  • 1
    Yes, it is the same only in my mp3 case. But sometimes you don't want to touch your branches and tags (no space reduction in public repositories) but you want to speed up a "git clone/fetch/pull" of only a branch (less space for dedicated-to-that-branch repositories). – Daniel Fanjul Feb 13 '09 at 12:50
15

The solution I'd like to propose is based on orphan branches and a slight abuse of the tag mechanism, henceforth referred to as *Orphan Tags Binary Storage (OTABS)

TL;DR 12-01-2017 If you can use github's LFS or some other 3rd party, by all means you should. If you can't, then read on. Be warned, this solution is a hack and should be treated as such.

Desirable properties of OTABS

  • it is a pure git and git only solution -- it gets the job done without any 3rd party software (like git-annex) or 3rd party infrastructure (like github's LFS).
  • it stores the binary files efficiently, i.e. it doesn't bloat the history of your repository.
  • git pull and git fetch, including git fetch --all are still bandwidth efficient, i.e. not all large binaries are pulled from the remote by default.
  • it works on Windows.
  • it stores everything in a single git repository.
  • it allows for deletion of outdated binaries (unlike bup).

Undesirable properties of OTABS

  • it makes git clone potentially inefficient (but not necessarily, depending on your usage). If you deploy this solution you might have to advice your colleagues to use git clone -b master --single-branch <url> instead of git clone. This is because git clone by default literally clones entire repository, including things you wouldn't normally want to waste your bandwidth on, like unreferenced commits. Taken from SO 4811434.
  • it makes git fetch <remote> --tags bandwidth inefficient, but not necessarily storage inefficient. You can can always advise your colleagues not to use it.
  • you'll have to periodically use a git gc trick to clean your repository from any files you don't want any more.
  • it is not as efficient as bup or git-bigfiles. But it's respectively more suitable for what you're trying to do and more off-the-shelf. You are likely to run into trouble with hundreds of thousands of small files or with files in range of gigabytes, but read on for workarounds.

Adding the Binary Files

Before you start make sure that you've committed all your changes, your working tree is up to date and your index doesn't contain any uncommitted changes. It might be a good idea to push all your local branches to your remote (github etc.) in case any disaster should happen.

  1. Create a new orphan branch. git checkout --orphan binaryStuff will do the trick. This produces a branch that is entirely disconnected from any other branch, and the first commit you'll make in this branch will have no parent, which will make it a root commit.
  2. Clean your index using git rm --cached * .gitignore.
  3. Take a deep breath and delete entire working tree using rm -fr * .gitignore. Internal .git directory will stay untouched, because the * wildcard doesn't match it.
  4. Copy in your VeryBigBinary.exe, or your VeryHeavyDirectory/.
  5. Add it && commit it.
  6. Now it becomes tricky -- if you push it into the remote as a branch all your developers will download it the next time they invoke git fetch clogging their connection. You can avoid this by pushing a tag instead of a branch. This can still impact your colleague's bandwidth and filesystem storage if they have a habit of typing git fetch <remote> --tags, but read on for a workaround. Go ahead and git tag 1.0.0bin
  7. Push your orphan tag git push <remote> 1.0.0bin.
  8. Just so you never push your binary branch by accident, you can delete it git branch -D binaryStuff. Your commit will not be marked for garbage collection, because an orphan tag pointing on it 1.0.0bin is enough to keep it alive.

Checking out the Binary File

  1. How do I (or my colleagues) get the VeryBigBinary.exe checked out into the current working tree? If your current working branch is for example master you can simply git checkout 1.0.0bin -- VeryBigBinary.exe.
  2. This will fail if you don't have the orphan tag 1.0.0bin downloaded, in which case you'll have to git fetch <remote> 1.0.0bin beforehand.
  3. You can add the VeryBigBinary.exe into your master's .gitignore, so that no-one on your team will pollute the main history of the project with the binary by accident.

Completely Deleting the Binary File

If you decide to completely purge VeryBigBinary.exe from your local repository, your remote repository and your colleague's repositories you can just:

  1. Delete the orphan tag on the remote git push <remote> :refs/tags/1.0.0bin
  2. Delete the orphan tag locally (deletes all other unreferenced tags) git tag -l | xargs git tag -d && git fetch --tags. Taken from SO 1841341 with slight modification.
  3. Use a git gc trick to delete your now unreferenced commit locally. git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc "$@". It will also delete all other unreferenced commits. Taken from SO 1904860
  4. If possible, repeat the git gc trick on the remote. It is possible if you're self-hosting your repository and might not be possible with some git providers, like github or in some corporate environments. If you're hosting with a provider that doesn't give you ssh access to the remote just let it be. It is possible that your provider's infrastructure will clean your unreferenced commit in their own sweet time. If you're in a corporate environment you can advice your IT to run a cron job garbage collecting your remote once per week or so. Whether they do or don't will not have any impact on your team in terms of bandwidth and storage, as long as you advise your colleagues to always git clone -b master --single-branch <url> instead of git clone.
  5. All your colleagues who want to get rid of outdated orphan tags need only to apply steps 2-3.
  6. You can then repeat the steps 1-8 of Adding the Binary Files to create a new orphan tag 2.0.0bin. If you're worried about your colleagues typing git fetch <remote> --tags you can actually name it again 1.0.0bin. This will make sure that the next time they fetch all the tags the old 1.0.0bin will be unreferenced and marked for subsequent garbage collection (using step 3). When you try to overwrite a tag on the remote you have to use -f like this: git push -f <remote> <tagname>

Afterword

  • OTABS doesn't touch your master or any other source code/development branches. The commit hashes, all of the history, and small size of these branches is unaffected. If you've already bloated your source code history with binary files you'll have to clean it up as a separate piece of work. This script might be useful.

  • Confirmed to work on Windows with git-bash.

  • It is a good idea to apply a set of standard trics to make storage of binary files more efficient. Frequent running of git gc (without any additional arguments) makes git optimise underlying storage of your files by using binary deltas. However, if your files are unlikely to stay similar from commit to commit you can switch off binary deltas altogether. Additionally, because it makes no sense to compress already compressed or encrypted files, like .zip, .jpg or .crypt, git allows you to switch off compression of the underlying storage. Unfortunately it's an all-or-nothing setting affecting your source code as well.

  • You might want to script up parts of OTABS to allow for quicker usage. In particular, scripting steps 2-3 from Completely Deleting Binary Files into an update git hook could give a compelling but perhaps dangerous semantics to git fetch ("fetch and delete everything that is out of date").

  • You might want to skip the step 4 of Completely Deleting Binary Files to keep a full history of all binary changes on the remote at the cost of the central repository bloat. Local repositories will stay lean over time.

  • In Java world it is possible to combine this solution with maven --offline to create a reproducible offline build stored entirely in your version control (it's easier with maven than with gradle). In Golang world it is feasible to build on this solution to manage your GOPATH instead of go get. In python world it is possible to combine this with virtualenv to produce a self-contained development environment without relying on PyPi servers for every build from scratch.

  • If your binary files change very often, like build artifacts, it might be a good idea to script a solution which stores 5 most recent versions of the artifacts in the orphan tags monday_bin, tuesday_bin, ..., friday_bin, and also an orphan tag for each release 1.7.8bin 2.0.0bin, etc. You can rotate the weekday_bin and delete old binaries daily. This way you get the best of two worlds: you keep the entire history of your source code but only the relevant history of your binary dependencies. It is also very easy to get the binary files for a given tag without getting entire source code with all its history: git init && git remote add <name> <url> && git fetch <name> <tag> should do it for you.

Community
  • 1
  • 1
Adam Kurkiewicz
  • 1,526
  • 1
  • 15
  • 34
  • "You have to periodically use `git gc`" — stopped reading right there. Why would anyone give up their last safety belt in favor of some hack? – user1643723 Sep 16 '16 at 10:23
  • @user1643723 `git gc` is not unsafe to run. All your dangling commits will be safely keep on the hard-drive for at least 30 days by default: https://git-scm.com/docs/git-gc – Adam Kurkiewicz Sep 22 '16 at 08:46
  • Thanks for the detailed writeup. I wanted to try this as a way to store some binary dependencies in my GitHub repo in such a way that they are not downloaded by default when someone clones the repo, but can be downloaded manually & update the local repo. However, I got an error at this step: `git push 1.0.0bin` - `remote: error: GH001: Large files detected. You may want to try Git Large File Storage`. It looks like perhaps GitHub is no longer supporting this? The binary in question was 100MB in size. – user5359531 Jan 12 '17 at 19:16
  • 2
    To be completely honest, if you are allowed to use github for your work, what keeps you from using LFS? The guys at github have worked hard to create this product, and they're even hosting it for you and their infrastructure is optimised around using it. This hack is meant for situations when you really can't use LFS or other third-parties and you're after a pure-git solution. – Adam Kurkiewicz Jan 12 '17 at 19:23
  • I've also updated the answer to be more clear about how hacky this solution actually is. – Adam Kurkiewicz Jan 12 '17 at 19:27
13

In my opinion, if you're likely to often modify those large files, or if you intend to make a lot of git clone or git checkout, then you should seriously consider using another Git repository (or maybe another way to access those files).

But if you work like we do, and if your binary files are not often modified, then the first clone/checkout will be long, but after that it should be as fast as you want (considering your users keep using the first cloned repository they had).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
claf
  • 9,043
  • 17
  • 62
  • 79
  • 14
    And, separate repos won't make the checkout time any shorter, since you still have to check out both repos! – Emil Sit Feb 12 '09 at 14:34
  • @EmilSit separate repo could make the checkout far shorter if you steadily clean the history of the "binary repo". Moreover devs would not be forced to checkout both repos *each time*. – FabienAndre Oct 16 '13 at 17:01
  • Why not just have the main module's build script fetch the binary files from the second repo, extracting them one-by-one (like here: http://stackoverflow.com/questions/1125476/git-retrieve-a-single-file-from-a-repository). – akauppi Feb 06 '14 at 08:54
  • 1
    Even if your binary files aren't changed frequently, large files can still kill your workflow if you often push branches to the repository for collaboration purposes. – Timo Reimann Sep 12 '14 at 09:17
11

SVN seems to handle binary deltas more efficiently than Git.

I had to decide on a versioning system for documentation (JPEG files, PDF files, and .odt files). I just tested adding a JPEG file and rotating it 90 degrees four times (to check effectiveness of binary deltas). Git's repository grew 400%. SVN's repository grew by only 11%.

So it looks like SVN is much more efficient with binary files.

So my choice is Git for source code and SVN for binary files like documentation.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Tony Diep
  • 191
  • 1
  • 2
  • 34
    You just needed to run "git gc" (repacking and garbage collecting) after adding those 4 files. Git doesn't immediately compress all the added content, so that you will have a group-of-files compression (which is more efficient in terms of size) and won't have a slowdown of separately compressing every single added object out there. But even without "git gc", git would have done the compression for you eventually, anyway (after it noticed, that enough unpacked objects have accumulated). – nightingale Oct 04 '10 at 08:13
  • It sounds like git must always push entire files though instead of deltas? If so that would be be one other advantage for SVN when working with large files in that SVN supposedly only transfers the binary deltas in a commit. – jpierson Dec 30 '10 at 06:14
  • 2
    @jpierson Are you conflating "push" and "commit"? In SVN, there's no difference, but a git "commit" is a local operation. Computing the deltas immediately would just add overhead. – Aaron Novstrup Jul 12 '11 at 02:45
  • @Aaron - I have very little experience with git but my understanding is that push in git would be the best equivalent to commit in SVN when in a more traditional centralized setup. And in this case my original comment was that the push in git as I understand it would send up the whole file each push. This means if I only changed one pixel on a 100MB image that I wills till need to send each byte in the 100MB file over to the source repository in the git push. My understanding is that SVN is much more efficient in this type of scenario. – jpierson Jul 12 '11 at 03:04
  • 24
    @jpierson I created an empty git repository and added (and committed) an entirely white bmp image with a size of 41MB, this resulted in a total git repository with a size of 328KB. After a `git gc` the total git repository size was reduced to 184KB. Then I changed a single pixel from white to black and committed this change, the total git repository size increased to 388KB, and after a `git gc` the size of the total git repository was reduced to 184KB. This shows that git is pretty good in compressing and finding deltas of binary files. – Tader Aug 01 '11 at 12:22
  • @Tader - But when doing a push does that whole content go over the wire? – jpierson Aug 01 '11 at 16:20
  • 4
    @jpierson A push only transmits the differences. So, the first push would transmit all the data (compressed). Subsequent pushes will transmit only changes. – Tader Aug 09 '11 at 16:33
  • @Tader - Cool, thanks for your comments Tader. If these facts hold true then Git is truly impressive in terms of how efficiently it works with binary files or content in general. – jpierson Aug 10 '11 at 16:56
  • 6
    @jpierson A sidenote: I just commented on the binary deltas. Git will eat all your memory and swap if it is managing repositories with large (GB size) files. For this, use [git-annex](http://git-annex.branchable.com/) (already mentioned in an other answer)... – Tader Aug 18 '11 at 18:27
  • 1
    SVN doesn't handle, it just can do dir-based checkouts. Having a programmer fetching a Texture builder's Gigabyte data is stupid... – MeaCulpa Feb 21 '12 at 09:30
  • @Tader this memory use has improved in 1.7.9. From the release notes: _As another step to support large files better, "git add" stores large files directly into a new packfile without having to hold everything in-core at once._ I think there are plans to improve this further, but I'm not sure what will be targeted: repack performance? – Tobu Feb 21 '12 at 12:31
  • I can't believe no one has mentioned SVN's treatment of banches and tags. In Git or CVS a tag requires nearly zero storage. In subversion, a tag or branch is a copy so adding the first tag doubles the repo size. For that reason SVN would be a very poor choice for binary files. -- comment by Robert Boehne – John Dvorak Jan 23 '13 at 17:41
  • 12
    @JanDvorak - no-one has mentioned it, because it's completely untrue. Subversion Copies are cheap - http://svnbook.red-bean.com/en/1.7/svn.branchmerge.using.html - about the middle of the page. – Joris Timmermans Feb 11 '13 at 15:23
  • 13
    @Tader: your test is bad. What you call a binary file is in fact (from the perspective of git) more like a text file - the bitstream is byte-aligned, and there are meaningful, localized diffs to be made; after all, changing one pixel is basically equivalent to changing one character in a text file (and who uses uncompressed bitmaps nowadays?) Try the same experiment with a small video, compressed image, virtual machine, zipfile or whatever - and you'll find that git doesn't deal efficiently with the delta; indeed it's fundamentally impossible with incompressible data. – Eamon Nerbonne Dec 05 '13 at 23:45
  • @EamonNerbonne: Mostly right, but if the change to the video or compressed image is in some sense small, then there exists a small representation (compression) of it -- the problem is that *automatically* discovering these representations is very difficult. This could be mitigated if git exposes compression mechanisms for end-user tinkering... Do you know if it does? – j_random_hacker Oct 23 '15 at 13:05
6

git clone --filter from Git 2.19 + shallow clones

This new option might eventually become the final solution to the binary file problem, if the Git and GitHub devs and make it user friendly enough (which they arguably still haven't achieved for submodules for example).

It allows to actually only fetch files and directories that you want for the server, and was introduced together with a remote protocol extension.

With this, we could first do a shallow clone, and then automate which blobs to fetch with the build system for each type of build.

There is even already a --filter=blob:limit<size> which allows limiting the maximum blob size to fetch.

I have provided a minimal detailed example of how the feature looks like at: How do I clone a subdirectory only of a Git repository?

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • Nearly four years later, `git clone --filter` is supported by both Gitlab and Github but there's no utility to manage the endless growth of history that's downloaded on-demand to your working copy. Furthermore, native git is still nowhere close to being an alternative to Perforce which can manage petabytes of data: Github limits repos to 100 GB and Azure DevOps to 250 GB (at which point you start getting pack errors). Commonly-given advice implies that beyond 10 GB server performance starts to drop. Not even close to what would be needed for video game development. – Gabriel Morin Oct 20 '22 at 00:23
  • @GabrielMorin about history growth, what would you like to see as a feature beyond `git clone --depth 1`? – Ciro Santilli OurBigBook.com Oct 20 '22 at 05:46
  • @ciro-santilli-ourbigbook-com I want `--filter=blob:limit` to be constantly reapplied by a cleanup mechanism so blobs that got downloaded on-demand but that I haven't checked-out in a configurable amount of time get converted again into promisor blobs. I want to have the full history of the project locally as far as source files are concerned, while excluding all the game asset blobs which can range from a few MB to a few GB. This is completely different from what shallow clones i.e. `--depth 1` give you. Git would also have to be cool with unlimited size files. – Gabriel Morin Oct 20 '22 at 18:34
  • For large game projects, just the current revision of files can amount to 500 GB+, so you can't afford to have more than the latest copy in your .git folder. But partial clones never get rid of those blobs once they have downloaded them on-demand. If native git is going to compete with Perforce or PlasticSCM, it needs to solve 4 problems: 1. don't accumulate unwanted local history in partial clones 2. don't slow down as soon as big binary files are in the repo 3. don't fail with pack errors when overall repo size reaches 250 GB 4. let us offload the largest binaries to cheaper storage – Gabriel Morin Oct 20 '22 at 18:43
  • @GabrielMorin I see. `git gc` comes to mind, but not sure if it can do exactly what you want. That's where I'd start looking. Maybe the request would be for a `git gc --filter` option. – Ciro Santilli OurBigBook.com Oct 20 '22 at 19:59
2

I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. What are your experiences/thoughts regarding this?

I personally have run into synchronisation failures with Git with some of my cloud hosts once my web applications binary data notched above the 3 GB mark. I considered BFT Repo Cleaner at the time, but it felt like a hack. Since then I've begun to just keep files outside of Git purview, instead leveraging purpose-built tools such as Amazon S3 for managing files, versioning and back-up.

Does anybody have experience with multiple Git repositories and managing them in one project?

Yes. Hugo themes are primarily managed this way. It's a little kudgy, but it gets the job done.


My suggestion is to choose the right tool for the job. If it's for a company and you're managing your codeline on GitHub pay the money and use Git-LFS. Otherwise you could explore more creative options such as decentralized, encrypted file storage using blockchain.

Additional options to consider include Minio and s3cmd.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
vhs
  • 9,316
  • 3
  • 66
  • 70
1

Git LFS is the answer

# Init LFS
git lfs install
git lfs track "large_file_pattern"

# Then follow regular git workflow
git add large_file
git commit -m "Init a very large file"
git push origin HEAD

Behind the scene, git lfs would create a reference to your large file and does not store directly in the git repo

For more info: https://git-lfs.github.com/

Dat
  • 5,405
  • 2
  • 31
  • 32
0

Have a look at camlistore. It is not really Git-based, but I find it more appropriate for what you have to do.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hernan
  • 5,811
  • 10
  • 51
  • 86