C++ OpenSSL: md5-based 64-bits hash

Question

I know the original md5 algorithm produces an 128-bits hash.

Following Mark Adler's comments here I'm interested in getting a good 64-bits hash. Is there a way to create an md5-based 64-bits hash using OpenSSL? (md5 looks good enough for my needs). If not, is there another algorithm implemented in the OpenSSL library that can get this job done with quality not less than md5's (except for the length of course)?

I mean, what criteria do you need from your hash function? Speed? Cryptographic strength? A uniform distribution throughout the hash space? Low chance of collisions? Sensitivity to single-bit changes? — Thomas, Mar 17 '13 at 10:03
I want the chances of collisions to be at least as low as md5 provides. — Subway, Mar 17 '13 at 11:16
You can't have that if you use only 64 bits whereas MD5 usess 128. But it looks like there is already an answer below that might work for you. — Thomas, Mar 17 '13 at 11:21
I understand that. But since the number of my database entries is not so high, I don't need such a long hash, as Adler pointed out in the link I've mentioned in my post. So, indeed I won't come up with a hash as strong as md5 is because of it's short length but I want it to be algorithmically as strong as md5 is with respect to collisions. Let's put it this way: I want it to be preventing collisions on a small database as md5 does on databases 2^32 times bigger (If you go to the aforementioned link and search for the string "A good hash algorithm" you will see why I wrote 2^32 and not 2^64). — Subway, Mar 17 '13 at 11:39

Sam · Accepted Answer · 2013-03-18T22:21:20.713

I claim, that 'hash quality' is strongly related to the hash length. AFAIK, OpenSSL does not have 64bit hash algos, so the first idea I had is simple and most probably worthless:

halfMD5 = md5.hiQuadWord ^ md5.lowQuadWord

Finally, I'd simply use an algorithm with appropriate output, like crc64.

Some crc64 sources to verify:

Edit

At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? – Subway

This might be simply due to the fact, that OpenSSL is a crypto library in the first place, using large hash values with appropriate crypto characteristics.

Hash algos for data structures have some other primary goals, e.g. good distribution characteristics for hash tables, where small hash values are used as an index into a list of buckets containing zero, one or multiple (colliding) element(s).

So the point is, whether, how and where collisions are handled. In a typical DBMS, an index on a column will handle them itself.

Corresponding containers (maps or sets):

C++: std::size_t (32 or 64bits) for std::unordered_multimap and std::unordered_multiset
In java, one would make a mapping with lists as buckets: HashMap<K,List<V>>

The unique constraint would additionally prohibit insertion of equal field contents:

C++: std::size_t (32 or 64bits) for std::unordered_map and std::unordered_set
Java: int (32bits) for HashMap and HashSet

For example, we have a table with file contents (plaintext, non-crypto application) and a checksum or hash value for mapping or consistency checks. We want to insert a new file. For that, we precompute the hash value or checksum and query for existing files with equal hash values or checksums respectively. If none exists, there won't be a collision, insertion would be safe. If there are one or more existing records, there is a high probability for an exact match and a lower probability for a 'real' collision.

In case collisions should be omitted, one could add an unique constraint to the hash column and reuse existing records with the possibility of mismatching/colliding contents. Here, you'd want to have a database friendly hash algo like 'Jenkins'.
In case collisions need to be handled, one could add an unique constraint to the plaintext column. Less database friendly checksum algos like crc won't have an influence on collisions among records and can be chosen according to certain types of corruption to be detected or other requirements. It is even possible to use the XOR'ed quad words of an md5 as mentioned at the beginning.

Some other thoughts:

If an index/constraint on plaintext columns does the mapping, any hash value can be used to do reasonably fast lookups to find potential matches.
No one will stop you from adding both, a mapping friendly hash and a checksum.
Unique constraints will also add an index, which are basically like the hash tables mentioned above.

In short, it greatly depends on what exactly you want to achieve with a 64bit hash algo.

Thanks for responding. You are absolutely correct, but what I need is an algorithm that prevents collisions for a small database as md5 does for databases 2^32 times bigger (If you go to the link I mentioned in my question and search for the string "A good hash algorithm" you will probably see what I mean). — Subway, Mar 17 '13 at 11:51
I see. One of these, e.g. Jenkins should work then?: http://en.m.wikipedia.org/wiki/List_of_hash_functions#section_3 — Sam, Mar 17 '13 at 12:25
At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? — Subway, Mar 17 '13 at 13:16
@Subway: See edit. Could you please add some details about what you need a 64bit hash for?. — Sam, Mar 17 '13 at 15:33

C++ OpenSSL: md5-based 64-bits hash

1 Answers1