4

I'm searching for a tool, which help me to analyze the disk space requirements of different files in a repository.

In my repository there are bigger binarys with several revisions.

So I'm for example intrested in how much space all this revisions of a single binary use in repository. AFAIK this information is not easily available via 'list' command, since I don't know how efficient the deltification of svn works.

Or which are the files/folders which use the most disk space (not only in the head revision, but in all revisions together)

Any idea?

  • 1
    Your real solution would be to not store binaries in svn. – thekbb Feb 19 '13 at 17:04
  • Thanks for your commment thekbb. My project contains not only source code, but also bigger test data, which is stored in *.xlsx or Matlab *.mat files. I'd like to use the advantages of version controll also for these files. – user2087749 Feb 20 '13 at 07:53
  • I'm aware that this might cause problems with disk space. But before considering different approaches I want to know how bad it is. That's where my question was coming from. – user2087749 Feb 20 '13 at 08:02
  • This doesn't help you at all now, but svn 1.8 has built in stats gathering that does what you're asking: http://subversion.apache.org/docs/release-notes/1.8.html#fsfs-stats – thekbb Mar 04 '13 at 13:58
  • Thanks for the hint to fsfs-stats, thekbb! – user2087749 Mar 05 '13 at 09:10

2 Answers2

5

How much storage a node uses in Subversion is not as straightforward as it may seem. I'm going to talk about FSFS (and provide a hack of an answer for FSFS only) since that's almost certainly the filesystem implementation you're using. If you're using BDB things are a little different.

A node can use up storage 4 ways. The actual text or body of the node, properties, and by the nature of existing they use storage in the directory node noting their existence (directory nodes have a body that consists of a dictionary of their children and the representation of the child), and finally the overhead of the file system (when you commit to a file it bubbles up new representations of the directories up to the root, so in my opinion that use of storage should belong to the files that caused it to be needed to be stored).

The space taken by the file text and properties is relatively easy to come up with, the directory storage and the overhead and much harder. Yet, even for the relatively easy question of the file text, due to representation sharing, it's still slightly complicated. Representation sharing happens when two files are identical (the files could have the same name, or not it doesn't matter, the only thing that matters is their text is the same), we avoid storing it again.

The following one-liner should answer the file text question for a single file.

REPO=~/my-repo; FILE=/somebigfile; grep --recursive --no-filename --text --before-context 3 "cpath: $FILE" "$REPO/db/revs/"* | grep 'text:' | cut -d' ' -f 1-7 | sort -u | awk '{ DISK+=$4; if ($5 == 0) { FULL += $4 } else { FULL += $5 } } END { print DISK, FULL, FULL-DISK}'

You'll need to change REPO to be set to the path to your repository and FILE to be the absolute path inside the repository to the file you want. This may not work perfectly since I may have forgotten some detail or another. But let me walk through how this works.

It greps every revision file for the the file you're looking for, asking for the preceding 3 lines as well as the match line. Then it removes everything except for the lines with text: on them (the lines detailing the text representation). We then exclude the last field (the uniqueifier; which is used to distinguish between shared representations). This allows us to limit it to unique representations we actually stored. We then sum the 5th and the 4th fields (which are the full text size and the representation size respectively). The full text size can be zero which means it's the same as the representation size (we stored the full text not a delta). Finally we print out the following fields: the size if we actually stored, the size of all versions of the file in full text, and finally the difference (negative number means we were less efficient than storing plaintext, positive means we saved that much space).

The fields of the text data are as follows:

revision offset_in_rev_file size_of_rep size_of_full_text md5 sha1 uniquifier

Older repositories may not have all of these fields, that's fine.

Because I'm depending on the text field to be within 3 lines of the cpath field in the rev file (hey this is a quick hack) it may not work perfectly. You may want to run the first two grep commands without all the rest and then look at the revisions provided (they'll be the first set of numbers from the left). Compare that with the outout of svn log for the file. If all the revs are there then it should be accurate.

If I find the time I'll try to writeup a utility that does this the right way (using the SVN libraries) and that is more useful. Probably will include the storage used by properties and maybe include some of the other storage I mentioned above.

TL;DR It's not an easy question to answer. Use the shell script above to answer the storage of a file text. It'll give you output that is the space we used on disk, the space of the full text of all revisions, and then how much we saved (negative means we lost space due to delta overhead).

Ben Reser
  • 5,695
  • 1
  • 21
  • 29
1

It is possible to dump a repository and filter out older unneeded versions of the binaries and then load the dump back to a repo of the same name.

What's your tooling / build look like?

Another thing to keep in mind - if you ever migrate to git or hg, each time a you clone you pull down the entire history of those binary files... so disk space becomes an issue on the client as well.

thekbb
  • 7,668
  • 1
  • 36
  • 61
  • 2
    I've read in many places that SVN can do deltas on binaries, is this not true? e.g. http://stackoverflow.com/questions/538643/how-good-is-subversion-at-storing-lots-of-binary-files – James P Feb 20 '13 at 15:51
  • @JamesP correctly pointed out my mistake - svn does indeed store delta on binary. Thanks, man. – thekbb Feb 20 '13 at 21:20