6

I have a "large" (5 mb) text file in a git repo. If I add a character at the last line and run git add my .git folder increases in size with approx 1 mb (which I assume is the compressed size of my 5 mb file).

The same happens for each time I edit and add.

If I run git add -p file I get a nice diff back of just a few bytes. But anyway the large object file gets stored when I full fill the add.

Running git gc --prune=now removes the large object files, and things still seems to work as expected.

But regularly running git gc after each add is not a good option since I use git in an automatic way on a SD-card which will wear out the card writing and deleting megabytes in that way.

So, my question(s) is

1) I am I right that this is the behavior of git? or do I misunderstand something?

2) Can I avoid this and make git only save the diff?

I have no problem trading away flexibility in restoring old changes and so on. There is no need for branching or stashing or other things that can complicate life for git.

edit Just to be clear, my problem isn't that git saves the whole file once. But that it stores the whole file for each edit. If I add 10 characters with add and commit between each character-editing, it saves the whole file (in compressed form) 10 times.

Nicklas Avén
  • 4,706
  • 1
  • 18
  • 15
  • Might be useful: http://blog.deveo.com/storing-large-binary-files-in-git-repositories/ – Swift - Friday Pie Jan 05 '17 at 10:57
  • Thanks, I have ssen that link but it seems to be more about handling files that is so big that the size of them is the problem. Or the problem that deleted files still takes space in git.index. But those things is acceptable for me. – Nicklas Avén Jan 05 '17 at 11:12

3 Answers3

9

Git stores all files as "objects" (specifically, as blob objects, with blobs being one of the four possible object types in Git). But this is not the whole story.

Each object is uniquely identified by its contents. The contents of the object are turned into a cryptographic hash (specifically, SHA-1, with the raw contents being prefixed by an object type—in this case blob—and a decimalized representation of its size in bytes and a single ASCII NUL byte, followed by the actual object bytes). Hence if you add the exact same file more than once, you get the same hash, because the raw contents remain the same—but if you change even a single byte, you get a new object, with a new and different hash.

This is why your repository grows by ~1 MB: as you surmised, 1 MB is the size of the compressed 5 MB object. One byte is different, so the new object has a new ID and is stored as a new "loose" object. A loose object consists of the compressed object and header, stored in its own separate file ... but not all objects are loose. Git also provides packed objects.

Packed objects are considerably more complicated. Objects stored in a pack are "deltified": compressed with Git's special modified variant of libXdiff (see also Is the git binary diff algorithm (delta storage) standardized?). Git chooses a base object and a series of derived objects that are then compressed against the base. With any luck, your files will be compressed against themselves, so that once they are packed, they go back to being relatively small, except for the base file itself.

Git normally chooses when to make pack files on its own, and its usual code handles most ordinary source files pretty well. Very large text files will unbalance the automatic packing somewhat, so you might want to experiment with "hand packing" (using an occasional git repack -a -d and/or tweaking the window parameters) to see if you can get better results. However, note that except for "thin packs" used to send deltas across a network connection, pack files require the base object to be present in the same pack as all the deltified objects. If your large file will change often, it will be counterproductive to pack it often, as you will get many large packs (though the -a -d step should consolidate packs as long as you are not using "keep" files on them).

(If you modify the work-tree version of the file and git addthe result and it gets a new hash, Git will immediately package it up as a loose object, regardless of any existing packed versions.)

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775
  • +1 Thanks! This makes sense. Maybe git is not the right tool for what I need. Your information helped a lot to get an understanding of git objects and packed objects. – Nicklas Avén Jan 05 '17 at 12:59
  • It may not be, although 5 MB is small in the grand scheme of things these days. Still, this is the reason there are so many auxiliary systems for storing big files outside of Git (the two that come to mind being Git-LFS and git-annex). – torek Jan 05 '17 at 18:35
  • The problem is the total. There are many files involved and quite frequent adding new edits. As I understand things the answer to my 2 questions above is 1) Yes, 2) No. So git will be replaced in this hackish setting – Nicklas Avén Jan 06 '17 at 11:20
1

You can see the documentation here.

It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server. To see what happens, you can manually ask Git to pack up the objects by calling the git gc command:

So, don't worry about this, git will pack your file and only keep the difference automatically to save disk space when there are too many objects. Also, you can run git gc manually.

ramwin
  • 5,803
  • 3
  • 27
  • 29
0

That's common issue with all source control systems. They are meant to store code they can parse as text. Anything that isn't text, isn't stored differentially. Unrecognized files are simply uploaded. As one who was maintaining several repositories at work I had deal with users who were able to increase repository size by gigabytes by uploading large file, then moving it or re-uploading.

Swift - Friday Pie
  • 12,777
  • 2
  • 19
  • 42