1

As I understand it, git stores full files of each revision committed. Even though it's compressed there's no way that can compete with, say, storing compressed patches against one original revision full file. It's especially an issue with poorly compressible binary files like images, etc.

Is there a way to make git use a patch/diff based backend for storing revisions?

I get why the main use case of git does it the way it does but I have a particular use case where I would like to use git if I could but it would take up too much space.

Thanks

Shovas
  • 215
  • 1
  • 9
  • Note: If the files don't change, they aren't stored multiple times. – choroba Aug 18 '15 at 16:16
  • This may interest you: https://about.gitlab.com/2015/02/17/gitlab-annex-solves-the-problem-of-versioning-large-binaries-with-git/ – Andy Lester Aug 18 '15 at 16:23
  • In Git's typical use-case, it is very efficient in storage. Often better than diff-based schemes. If you have a different use-case, you may need a different tool. Subversion, for instance is good at managing large binary objects. What is the scenario you are imagining? – Wolf Aug 18 '15 at 16:36
  • 1
    @Andy, Thanks for the GitLabs link. Sadly I realy don't like their recommended usage. I don't want to add any extra steps to the process. I Just want to keep it simple. – Shovas Aug 18 '15 at 17:47
  • @Wolf What I'd like to do right now is an incremental style database backup. In other cases I do have issues with storing larger-ish files like flash objects, images, videos, audio files, archives, etc. I wish there was a way to store those revisions as patches. – Shovas Aug 18 '15 at 17:49
  • I don't think there's a way to do this out-of-the-box. If you were really really in need of this, you could maybe write your own object-database backend for Git; but looking at the code, this is not exactly designed as a replaceable layer. – Wolf Aug 18 '15 at 17:54
  • "...there's no way that can compete with..." You might want to verify that assertion. There is quite a lot of evidence out there that indicates that in most cases, `git` storage is more compact than many of the other alternatives, once you've hit the threshold where things get packed into pack files... – twalberg Aug 18 '15 at 17:55

1 Answers1

1

Git does use diff based storage, silently and automatically, under the name "delta compression". It applies only to files that are "packed", and packs don't happen after every operation.

  • git-repack docs:

    A pack is a collection of objects, individually compressed, with delta compression applied, stored in a single file, with an associated index file.

  • Git Internals - Packfiles:

    You have two nearly identical 22K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?

    It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.

    Later:

    The really nice thing about this is that it can be repacked at any time. Git will occasionally repack your database automatically, always trying to save more space, but you can also manually repack at any time by running git gc by hand.

  • "The woes of git gc --aggressive" (Dan Farina), which describes that delta compression is a byproduct of object storage and not revision history:

    Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.

    Later, quoting Linus, about the tendency of git gc --aggressive to throw out old good deltas and replace them with worse ones:

    So the equivalent of "git gc --aggressive" - but done properly - is to do (overnight) something like

    git repack -a -d --depth=250 --window=250
    
Jeff Bowman
  • 90,959
  • 16
  • 217
  • 251
  • Whoa. Mind blown. I can't believe I didn't know this about `git gc`. I tested it out with some commits and checking .git before and after and you're right it's keeping .git very smal. Beautiful. Exactly what I wanted. And a great indept answer with insightful reading! Thank you! – Shovas Aug 18 '15 at 18:39
  • @Shovas You're welcome! Bear in mind that you will not usually have to trigger `git gc` yourself, as git will often follow common operations with a quick incremental gc on its own. Reserve manual `gc` and `repack` calls to when you add a whole lot of compressible files, or are stepping away and can let git process for a while. Enjoy! – Jeff Bowman Aug 18 '15 at 18:42
  • Git gc delta also seem to delta binary files, is that true? I've tried a gzip'd sql dump and some jpg images and it seems to be able to save significant space with `git gc`. – Shovas Aug 18 '15 at 18:44
  • Yes, it works for binary files as well. If you're curious about the algorithm or format, see [this SO question](http://stackoverflow.com/q/9478023/1426891). – Jeff Bowman Aug 18 '15 at 18:47