1

I want to calculate the MD5 and SHA checksum of a series of huge files. Each file is about 1GB, so I wish to be as fast as possible.

Could anyone help to recommend some efficient C++ library?

BTW,

When reading file, fread( buffer, sizeof(char), BUFFER_SIZE, fin ), what size of BUFFER_SIZE is reasonable?

buaagg
  • 599
  • 6
  • 11
  • Typically reading in chunks that correspond to the underlying size of allocation units and/or what size of buffers the operating system typically use will give the best performance. Typically something like 4096 bytes is optimal. – krisku Sep 28 '13 at 10:09
  • 2
    Buy a smoking-fast SSD disk and have at least 1GB of contiguous memory in your address space to mmap the file. Seriously. Until you've proven you're CPU or memory-bus-bound (i.e. you've *properly* benchmarked and found those areas wanting) any reasonable implementation will suffice. Lest you forget your Knuth. I'd lay a strong bet you'll be disk-io bound in nearly all situations, and unless you can pick up the pace by hitting the spin-spindle-faster button on a file that size, the lib you choose will likely make little difference, so long as the author of said same wasn't a quack. – WhozCraig Sep 28 '13 at 10:21
  • possible duplicate of [In C++, How to get MD5 hash of a file?](http://stackoverflow.com/questions/1220046/in-c-how-to-get-md5-hash-of-a-file) – Zaffy Sep 30 '13 at 18:23

3 Answers3

2

On top of my head I do not know any fast C++ library. Computing a hash is relative straightforward, so any C library will be as easy to use (you can easily wrap it in a C++ class yourself). I found the following site where a guy implemented several hashing algorithms in x86 assembly and compared them to "official" C implementations of the same algorithms:

https://www.nayuki.io/page/fast-sha1-hash-implementation-in-x86-assembly
https://www.nayuki.io/page/fast-md5-hash-implementation-in-x86-assembly

Those implementations should be a good starting point and then you just have to make the file I/O as efficient as possible. Memory-mapped I/O is usually very efficient, or then you could go complex and use two threads: one thread reads chunks from the file and the other thread hashes the read data. The idea here would be to always keep the process doing something useful, i.e. hashes can be calculated while waiting for more data to be read from the file.

Nayuki
  • 17,911
  • 6
  • 53
  • 80
krisku
  • 3,916
  • 1
  • 18
  • 10
2

You could use Openssl. Search for Mysticial answer about MD5 large file How to create a md5 hash of a string in C? When you look into Openssl SHA docs you will see that MD5 and SHA ways of using these functions are the same. SHA Openssl Docs

Community
  • 1
  • 1
MKAROL
  • 316
  • 3
  • 11
2

I personally would do FILE *pipe = popen("md5sum filename"); [or something to that effect] - it is likely to be as fast as anything else, since 1GB of a file will take a little while to read, and the calculation is unlikely to be using much of your CPU time - most of the time will be waiting for the disk to load up the file.

On my system, I created 6 files of 1GB each, and it takes 2 seconds to checksum the file with md5sum. (12 seconds for all 6 files).

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227