Versioning large text files in git

Question

I've used git for awhile for source control and I really like it. So I started investigating using git to store lots of large binary files, which I'm finding just isn't git's cup of tea. So how about large text files? It seems like git should handle those just fine, but I'm having problems with that too.

I'm testing this out using a 550mb size mbox style text file. I git init'ed a new repo to do this. Here are my results:

git add and git commit - total repo size is 306mb - repo contains one object that is 306mb in size
add one email to the mailbox file and git commit - total repo size is 611mb - repo contains two objects that are each 306mb in size
add one more email to the mailbox file and git commit - total repo size is 917mb - repo contains three objects that are each 306mb in size

So every commit adds a new copy of the mailbox file to the repo. Now I want to try to get the size of the repo down to something manageable. Here are my results:

git repack -adf - total repo size is 877mb - repo contains one pack file that is 876mb in size
git gc --aggressive - total repo size is 877mb - repo contains one pack file that is 876mb in size

I would expect to be able to get the repo down in size to something around 306mb, but I can't figure out how. Anything larger seems like a lot of duplicate data is being stored.

My hope is that the repo would only increase by the size of the new email received, not by the size of the entire mailbox. I'm not trying to version control email here, but this seems to be my big hold back from using a nightly script to incrementally back up users' home directories.

Any advice in how to keep the repo size from blowing up when inserting a small amount of text to the end of a very large text file?

I've looked at bup and git annex, but I'd really like to stick with just plain old git if possible.

Thank you for your help!

I just tried this with 300MB and then 3MB of /dev/urandom, and the whole thing packed down to 302MB. — Josh Lee, Oct 31 '11 at 21:06
Check the mbox file isn't being compressed or encrypted by your mailer. Also, what version of git are you using? — Schwern, Oct 31 '11 at 21:09
Doesn't actually answer the question you're asking. But maybe use maildir instead? — Edward Thomson, Oct 31 '11 at 21:38
The mbox file is plain text, and git diff shows only new emails being appended to the end of the file. I was using git version 1.7.5.4 (I didn't realize I was that far behind) and have since upgraded to git version 1.7.7 and I'm still seeing the same behavior as I originally posted. I also tried the same procedure on a 64mb mbox file and git gc kept that repository at about 34mb even after a few new emails and commits. So it seems like the size of the mbox file is the issue (the original test file was 550mb). — user1020774, Nov 01 '11 at 20:27
I just found core.bigFileThreshold and set that to 1024mb, and now git gc is trying to delta compress where before it would very quickly pass that step. But now I'm getting an out of memory error. I'm trying some other config options to see if I can get past this. — user1020774, Nov 01 '11 at 20:30
I added pack.windowMemory to my config and set it to 256m. Running git gc now works fine, and my repository size is down under 300mb after more new emails and commits. So the solution looks to be to use the latest version of git and set core.bigFileThreshold and pack.windowMemory to appropriate values. — user1020774, Nov 01 '11 at 20:53
@user1020774 Would you post those comments as an answer for future reference? I'd never heard of those config options before. — Schwern, Nov 03 '14 at 18:13

score 5 · Answer 1 · answered Oct 31 '11 at 21:21

Git isn't the greatest backup tool, but it should be able to handle appending to a text file very efficiently. I was suspicious of your results. I repeated your experiment with a 354 meg file and git 1.7.7 on OS X. Here's my actions and the size of .git.

git init (52K)
git add mbox && git commit (110M)
cat mail1 >> mbox && git commit -a -m (219M)
git gc (95M)
cat mail2 >> mbox && git commit -a -m (204M)
git gc (95M)

As you can see, git is being very efficient. 94 megs is the size of the compressed mbox. It can't get much smaller.

I'm guessing your either using an old version of git or your mbox file is being compressed or encrypted by your mailer.

Check that the contents of your mbox which git is seeing is plain text.
If you're not using the latest git, upgrade and try again.

score 3 · Answer 2 · edited Feb 22 '17 at 11:02

I don't think git will do a good job at storing deltas in general, and even if you can finagle it to do so, it won't be deterministic. That said, based on http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/, you may want to try git repack -a -d --depth=250 --window=250.

I suspect your best option is to truncate your history using git --rebase, and only store the past few backups. You could do this using git branches. Make a branch called yearly, monthly, and daily. Every day, commit to daily, then use git rebase --onto HEAD~4 HEAD~3 daily to delete backups older than 3 days old. On the first day of every week, checkout weekly and git cherry-pick daily, then do the same git rebase to remove weekly backups older than 3 weeks. Finally, on the first day of every year, follow a similar process. You will probably want to do a git gc after this sequence each time, to free up the old space.

But if you're doing this, you're not taking advantage of git anymore and abusing the way it works a fair amount. I think the best backup solution for you does not involve git.

I would be suspicious of any information about Git from 2007, especially the internals, especially the garbage collection. It was only two years old at the time. — Schwern, Nov 03 '14 at 18:12

VonC · Answer 3 · 2017-02-22T11:39:31.023

One of the side effects of large files is that git diff can run out of memory.

While Git isn't the right tool (as mentioned in the other answers), at least the git diff issue is mitigated in git 2.2.0 (Q4 2014).
See commit 6bf3b81 from Nguyễn Thái Ngọc Duy (pclouds):

`diff --stat`: mark any file larger than `core.bigfilethreshold` binary

Too large files may lead to failure to allocate memory.
If it happens here, it could impact quite a few commands that involve diff.
Moreover, too large files are inefficient to compare anyway (and most likely non-text), so mark them binary and skip looking at their content.

score 1 · Answer 4 · answered Oct 31 '11 at 21:12

While how much difference you see after packing the objects is based on the type of files etc, git is not a backup tool and should not be used for that case. If you look at the entire philosophy of git, it is based on the assumption that disk space is cheap and makes optimization on the speed of the operations. Also, whether the type of file is binary or text, git is going to store it the same way, and ofcourse, as mentioned above, the type of file will determine how much difference you see after packing. It is only for diff and other purposes that git makes a distinction between binary and text files and not for storing.

Use appropriate backup tool which will also save you disk space. Something like ZFS for backups will be worth trying out: https://svn.oss.prd/repos/SHAW/BuildAndReleaseTransition/TeamCity/TeamCityConfiguration-39/TeamCityConfiguration.docx

Versioning large text files in git

4 Answers4

`diff --stat`: mark any file larger than `core.bigfilethreshold` binary

Linked

Versioning large text files in git

4 Answers4

diff --stat: mark any file larger than core.bigfilethreshold binary

Linked

`diff --stat`: mark any file larger than `core.bigfilethreshold` binary