10

In our project files, if there are binary files, such as .doc, .xls, .jpg, and we choose to not keep their past revisions (just keeping a latest version is ok), is there a way to tell SVN, Git, or Mercurial or some other tool to skip the revisions for these files or for a particular folder?

Say, there is a 4MB .doc file that I need to check in hundred of times, but I don't really care so much about its past versions. So if the system keeps 100 revisions of it, that's already 400MB... checking in 300 times means 1.2GB for 1 file and that's not good. Only the latest version is good so that everybody can sync to it. Also I don't want other people check out the project and have to check out 20GB of stuff. (will Git and Mercurial keep all revision in each person's local repository?)

nonopolarity
  • 146,324
  • 131
  • 460
  • 740
  • 9
    Sure, lots of them: they're called "filesystems". :-) – Ken Jun 14 '10 at 15:07
  • 3
    Why wouldn't you care about past revisions? Unless it's automatically generated files, and then there's an argument that those shouldn't be in version control anyway. – Chris K Jun 14 '10 at 18:57
  • 1
    but filesystem won't let you sync to removal of files, addition, rename of files automatically. it also won't sync up to any structure change of folders. – nonopolarity Jun 14 '10 at 19:09

10 Answers10

17

Note that this is not quite an answer.

If I forgo the discussion around not keeping the correct version of the file for posterity, I will at least comment on one part of your question, that might make you reconsider not keeping all the revisions of the file in the repository.

Version control systems typically doesn't store the entire file on each new revision, they store changes. Depending on the system, you might occasionally have a full copy of the file, but most of the changesets will be changes only.

For instance, in Mercurial, I tried this: First I downloaded the C# 3.0 language specification as a word file from this url: http://download.microsoft.com/download/3/8/8/388e7205-bc10-4226-b2a8-75351c669b09/CSharp%20Language%20Specification.doc

Then I committed this to a fresh Mercurial repository. Size before the commit (empty repository) was 80 bytes, size of file on disk was 2.387.968 bytes, and repository after commit was 2.973.696 bytes. Note that the file is now effectively stored twice, once in my working copy (the one I can edit), and once in my repository as part of my initial commit.

Then I opened the file, and changed all occurances of 3.0 with 4.0 (without the quotes), and all occurances of C# with VB, and saved. Then I committed the new version with a single-letter comment. Size of repository after commit is now 3.497.984 bytes. Difference is 512KB (there's some chunking involved in the repository, hence the size being an exact 512KB value.)

If I now open up the file again, change only the title page VB back to C#, save, and commit again, the size of the repository grows by 276KB, up to 3.780.608 bytes.

As you can see, changes does not commit an entire copy of the file, but granted, the differences aren't in the "10KB" range either.

Let's assume that the average size of each diff, for this file alone, will be somewhat inbetween those, let's say averages to 50% between the two values. This means that 300 commits of changes to this file, averaging 394KB totals 115MB. This is not alot

My suggestion is as follows:

  • Stop being cheapskates, disk space is cheap, compared to the headache you will have when someone says "I really wish I knew what that file looked like last week before you corrupted it".
Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • 1
    at work, when hard disk space is seldom used up, I think it doesn't matter. If it is the home computer, I really don't want to waste 20GB, 30GB or 60GB as time goes by, on each of the home computers. If the computer has 300GB hard drive, I am wasting 10% of it just because of not caring about it. – nonopolarity Jul 14 '10 at 01:40
  • also, Lasse was looking at text file, but I am talking about binary files. – nonopolarity Jul 16 '10 at 02:43
  • 1
    Most VCS store diffs for binary files only, text is just a binary file with interpretation. – Lasse V. Karlsen Jul 16 '10 at 14:06
  • 1
    @動靜能量 Lasse was talking about a .doc file. A .doc file is NOT a text file. It is a proprietary binary format that happens to store textual information. – HardlyKnowEm Mar 28 '12 at 15:28
4

A quick check of hard drive prices puts 1 terabyte (TB) internal drives around $75 USD each. Using your math, that's 250,000 copies of your 4MB file, or $0.0003 per copy. Typical overhead for a programmer for an hour is around $100.

What costs more: keeping all of the versions of that file, or paying a programmer to recreate an older version if you ever need that copy again?

Craig Trader
  • 15,507
  • 6
  • 37
  • 55
  • 1
    I second your opinion, but: The main cost is not the hard drive but the (tape) backups. – ur. Jun 14 '10 at 15:01
  • 1
    That's even easier: backup up to external hard drives. They're faster and more reliable than tape, and cheaper once you factor in the price of the tape changer and all of the media. – Craig Trader Jun 14 '10 at 17:15
  • Keep in mind that *THAT* $75 USD is for a "Consumer" harddrive, if you are talking about "SAN" harddrives, I've heard them weight in at about $1k USD per TB.... (This info might be old/etc but you get the idea) – Pharaun Jun 14 '10 at 19:14
  • Oh, enterprise drives do cost a bit more than consumer drives, but they don't cost 10 times as much. Even if they did, that's still only $0.003 to store a 4MB file. – Craig Trader Jun 14 '10 at 20:51
  • And even suppose something outlandish like version control costing $5 PER MONTH in disk space, that's nothing at all compared to the time saved and the hassles avoided in the development process. Even if an intern paid $25 dollars an hour is able to fix a mistake that would have otherwise taken half an hour of his time, you've already more than made back your investment. – dimo414 Jun 15 '10 at 01:01
  • 1
    Ironically, you emphasize the importance of saving programmer time but fail to calculate the cost in programmer time of waiting to copy gigabytes of unneeded versioned binaries to make an initial clone. Programmers tend to assume they can just "grab and go" a DVCS repository; half-hour download time is a rude surprise. – Ron Burk Aug 09 '16 at 20:32
4

I do know one that does this, but you're not going to like the answer.

Its Visual Sourcesafe. Check the flag 'store only latest version' on a file and it stops keeping history.

If you want this feature with a decent SCM, I would recommend not putting the file in the SCM at all, but store it elsewhere like a document management solution, or even just a filesystem share.

gbjbaanb
  • 51,617
  • 12
  • 104
  • 148
3

This is not a job for VCS, but for the filesystem, like Ken said.

However, if you really need such a 'feature', you may use hooks mechanism, to delete previous (lets say, older than 3 commits) versions of the file from the history.

takeshin
  • 49,108
  • 32
  • 120
  • 164
2

Perforce can do it for you.

Check file types:

+S Only the head revision is stored Older revisions are purged from the depot upon submission of new revisions. Useful for executable or .obj files.

-or-

+Sn Only the most recent n revisions are stored, where n is a number from 1 to 10, or 16, 32, 64, 128, 256, or 512. Older revisions are purged from the depot upon submission of more than n new revisions, or if you change an existing +Sn file's n to a number less than its current value. For details, see the Command Reference.

Vitaliy
  • 702
  • 6
  • 19
2

For your specific need, where you can remove past versions whenever you want, a VCS (a Version Control System, made to never lose a version) are not well suited.

A repository manager (which is a more advanced solution than a simple shared path on a filesystem) is what you are looking for.
(E.g Nexus Sonatype, to mention only one)

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
1

The primary responsibility of version control systems is to keep a history of changes, so I don't think this is possible. Why use a version control when you only want the latest version?

jfs
  • 16,758
  • 13
  • 62
  • 88
1

In general, no: a VCS is intended to keep the entire history. However, all is not lost on the space front; all the systems you named will store binary diffs for each revision, not a complete copy of the entire file. This means that the space required will often be much less.

Andrew Aylett
  • 39,182
  • 5
  • 68
  • 95
1

Why not use SVN for binary files and a DVCSS for all sources files? This way, you keep all revisions server-side but only one copy client side.. And for other sources, you get the benefit of having a real VCS.

I understand that we want to keep all revisions of a binary file somewhere but not pay the price for each "pull" every developers make on every clones they have.. That might be abusive..

Michel
  • 19
  • 1
0

If all you want is to sync files across computers, use Dropbox.

If you are using version control, then see what Lasse V. Karlsen wrote, disk space is cheap.

Jaanus
  • 17,688
  • 15
  • 65
  • 110