0

I have 3 terabytes, more than 300,000 reference files of all sizes (20, 30, 40, 200 megas each) and I usually back them up regularly (not zipped). A few months ago, I lost some files probably due to data degradation (as I did "backup" of damaged files without notice).

I do not care about security, so do not need MD5, SHA, etc. I just want to be assured that the files I'm copying are good (the same bits and bytes) and verify that backups are intact after a few months before making backups again.

Therefore, my needs are basic because the files are not very important and there is no need for security (no sensitive information). My doubt: the format/method "SFV CRC/32" is good and fast for my needs? There is something better and faster than that? I'm using the program ExactFile.

Are there any checksum faster than SFV/CRC32 but that is not flawed? I tried using the MD5 but it is slow and since I do not need data security, I preferred the SFV/CRC32. Still, it's painful, because there are more than 300,000 files and takes hours to make the checksum of all of them, even with CPU xeon 8 cores HT and fast HDD.

From the point of view of data integrity , there is some advantage in joining all the files in one .ZIP or .RAR instead of letting them " loose " in folders and files?

Some tips?

Thanks!

Maldon
  • 3
  • 4

3 Answers3

1

If you could quantify "few" and "some" in "A few months ago, I lost some files" (where "few" would be considered to be replaced with "every few" in order to get a rate), then you could calculate the probability of a false positive. However just from those words, I would say, yes, a 32-bit CRC should be fine for your application.

As for speed, if you have a recent Intel processor, you likely have a CRC-32C instruction, which can make the calculation much faster, by about a factor of 15. (See this answer for some code.) That could be made faster still by running it over multiple cores. If done right, you should be limited by the I/O, not the calculation.

There is no advantage in this case to bundling them in a zip or rar. In fact it may be worse, if a corruption of that one file causes you to lose everything.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Mark Adler, thank you for your clarification. I have files here since 1997 and I have been copying from HDD to HDD. So always want to use the checksum to verify that everything is OK. To this day I have never had great losses (only a few corrupted files), but I'm more paranoid each day with backups. One thing I quickly learned is never to compress the files. Regarding the "false positive", it means that even if the checksum is correct some files may be corrupted? Again, thanks for the clarification. – Maldon Dec 27 '15 at 01:10
  • Yes, a false positive would be a case where there are just the right errors in a file to return it's CRC to the original value. If a file is corrupted, the probability of that happening by chance in that one case is very small, about 2^(-32). Since the number of files being corrupted in your case appears to be very small, then that probability should be acceptable. – Mark Adler Dec 27 '15 at 03:36
0

If you aren't getting a throughput of at least 250 MB per second per core then you're probably I/O or memory-speed bound. The raw hashing speed of CRC32 and MD5 is higher than that, even on decades-old hardware, assuming a non-sucky reasonably optimised implementation.

Have a look at the Crypto++ benchmark, which includes a wealth of other hash algorithms as well.

The Castagnoli CRC32 can be faster than standard CRC32 or MD5 because newer CPUs have a special instruction for it; with that instruction and oodles of supporting code (for hashing three streams in parallel, stitching together partial results with a bit of linear algebra, etc. pp.) you can speed up the hashing to about 1 cycle/dword. AES-based hashes are also lightning fast on recent CPUs, due to the special AES instructions.

However, in the end it doesn't matter how fast the hash function waits for data to be read; especially on a multicore machine you're almost always I/O bound in applications like this, unless you're getting sabotaged by small caches and the latencies of deep memory cache hierarchies.

I'd stick with MD5 which is no slower than CRC32 and universally available, even on the oldest of machines, in pretty much every programming system/language ever invented. Don't think of it as a 'cryptographically secure hash' (which it isn't, not anymore) but as some kind of CRC128 that's just as fast as CRC32 but that requires some 2^64 hashings for a collision to become likely, instead of only a few ten thousand as in the case of CRC32.

If you want to roll some custom code then CRCs do have some merit: the CRC of a file can be computed by combining the CRCs of sub blocks with a bit of linear algebra. With general hashes like MD5 that's not possible (but you can always process multiple files in parallel instead).

There are oodles of ready-made programs for computing MD5 hashes for files and directories fast. I'd recommend the 'deep' versions of md5sum + cousins: md5deep and hashdeep which you can find on SourceForge and on GitHub.

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
0

Darth Gizka, thanks for the tips. Now I'm using md5deep 64 you indicated. It's very good. I used to use ExactFile, which stopped being updated in 2010, is still 32-bit (no 64bit version). I did a quick comparison between the two. The ExactFile was faster to create the MD5 digest. But to compare the digest, the md5deep64 was much faster.

My problem is HDD, as you said. For backup and storage, I use three Seagates with 2 TB each (7200rpm 64 mega cache). With an SSD the procedure would be much faster, but with terabytes of files is very difficult to use SSD.

A few days ago, I did the procedure in part of the archives: 1 tera (about 170,000 files). The ExactFile took about six hours to create the digest SFV / CRC32. I used one of my newer machines, equipped with an i7 4770k (with CRC32 instructions embedded, 8 cores - four real and four virtual, MB Gygabyte Z87X-UD4H, 16 RAM).

Throughout the calculations of files, the CPU cores were almost unusable (3% to 4%, maximum 20%). The HDD was 100% used, however, only a fraction of his speed power was achieved (sata 3), most of the time 70 MB / s, sometimes dropping to 30 MB / s depending on the number of files being calculated and anti virus in the background (which I disabled later, as I often do when copying large numbers of files).

Now I am testing a copy program that uses binary file comparison. Anyway, I will continue using md5 digests. Grateful for the information and any tip is welcome.

Maldon
  • 3
  • 4