SVN repo vastly bigger than the dumpfile?

Question

I've been put in charge of migrating our SVN installation from version 1.5.6 to 1.7.6. As part of that i did a dump/load cycle of both our repositories and happened to notice something odd..

One of the repos "dumps" to a 2GB file, but after loading it, it takes up nearly 23GB of diskspace. This was also an issue in 1.5.6, but we were hoping the upgrade might help with that.

The repo in question is a little "odd" in that it contains a single folder with 7500 files (used to be up to 12000) and a subfolder with another 500 or so files, and that is it.

It would appear that it may be related to this issue: 350GB SVN repo creates atleast 1MB revision for even a simplest task like branch/tag

I am very much at a loss for what we can do about this right now, but the repo is presently growing at a ridiculus pace and we will need to relocate it if we don't get it solved. A task I was hoping to avoid.

score 1 · Answer 1 · answered Sep 13 '12 at 07:11

1

First, SVN has two different repository backends: BDB (Berkley DB) and FSFS (File system). How the repository exists on disk is dependent on this choice, with the BDB typically being a bit larger. Which do you use?

If you use FSFS, then you should read up on sharding: when you commit a change, however small, it will be committed into a file whose minimum size is set by the disk - normally 2kb -16kb. Now multiply that up by the number of files being committed, and you can get very big numbers. The good news is that you can run a command to condense the shards into a single file:

svnadmin pack /path/to/repository

This might greatly improve your on-disk size.

Or the space problem might be the massive-number-of-files-per-commit problem you mention.

In any case, you ask why the dump file is very much smaller than the repository size. The dump file is a single file in a format that essentially is every commit ever made on the repository - this is a very terse form of the repository (especially if --deltas is used). Since this is placed into a single file, the issue of sharding is avoided.

I used to use and champion SVN in a previous organisation. Recently I moved myself to the Mercurial DVCS (also called Hg, and is similar to Git). Once you have made the switch, it's difficult ever thinking of going back. Anyway, here is a quote from Softpedia about repository size:

Disk space: When the Mozilla project was ported from SVN to Mercurial (very similar to Git in performance), disk space usage went down from 12GB to 420MB, 30 times smaller than the original size. Git is supposed to use the same storage algorithms, so file size should be around the same value.

You might want to investigate what would happen in your case if you switched to Hg or Git. If it is as dramatic as Softpedia's example, you could recommend Hg/Git to your management.

answered Sep 13 '12 at 07:11

Andrew Alcock

19,401
4
42
60

I'm going to try and run a pack command on the test-repo and let you know how it went. – Grubsnik Sep 13 '12 at 09:42
How's it going? The fact you are trying this implies you're using FSFS - is that right? – Andrew Alcock Sep 13 '12 at 10:43
Indeed it is, sadly the pack command didn't achieve the desired results. – Grubsnik Sep 13 '12 at 11:33
What's the size on disk on the original server? – Andrew Alcock Sep 13 '12 at 11:50
Same as before the move, ~23GB. Loading it into a 1.7 repo actually added about 50 megs to the total size. – Grubsnik Sep 13 '12 at 12:03
Hmmm. So the upgrade doesn't really change any disk requirement. What is the growth rate, eg 1GB/day? And your required max space on disk? With those two you can calculate the time before more radical actions have to be taken. The two courses of action are then migration to a JBOD storage (not too expensive) or maybe a migration to a DVCS? – Andrew Alcock Sep 13 '12 at 12:17
Moving the repo has always been an option, we we're just trying to avoid having to do that. At present we have roughly 30 working days of discspace left. So it's not super critical, but we wouldn't mind getting rid of the problem altogether instead of just throwing hardware at the problem and ignoring the underlying cause. – Grubsnik Sep 13 '12 at 13:25
I'm sorry but I am out of ideas. http://www.svnforum.org/threads/39015-production-svn-move!-1-4-BDB-to-1-6-FSFS shows another person in much the same issue as you (with the same dump file/repository ratio), and there does not appear to be anything that SVN has to offer. I guess it's either add some extra disk (a few 100 GB is very cheap now) or change to another technology. Sorry I was not able to do more. – Andrew Alcock Sep 13 '12 at 14:08
Thanks for trying though. Diskspace is indeed cheap, but it would still be better for it to work properly. – Grubsnik Sep 14 '12 at 08:12

SVN repo vastly bigger than the dumpfile?

1 Answers1