Are standard hash functions like MD5 or SHA1 quaranteed to be unique for small input (4 bytes)?

Question

Scenario:

I'm writing web service, that will act like identity provider for 3pty application. I have to send to this 3pty application some unique identifier of our user. In our database, unique user identifier is integer (4 bytes, 32 bites). Per our security rules I can't send those in plain form - so sending them out hashed (trough function like MD5 or SHA1) was my first idea.

Problem:

The result of MD5 is 16 bytes, result of SHA1 is 40 bytes, I know they can't be unique for larger input sets, but given the fact my input set is only 4 bytes long (smaller then hashed results) - are they guaranteed to be unique, or am I doomed to some poor-man hash function (like xoring the integer input with some number, shifting bites, adding predefined bites, etc.) ?

If this is an identity service, should'nt you be doing everything over TLS ? — Malkocoglu, Jul 26 '13 at 09:52
Does this even make sense? For such short input, finding a pre-image of a hash is trivial. Simply trying everything is very feasible. Using an obviously reversible function as you also propose is even more trivially reversible - by design even. This problem seems to call for encryption, not hashing. — harold, Jul 26 '13 at 09:54
@Malkocoglu - the connection itself is secure, the issue here is 3pty can't know our user ids, on the other hand I have to give them unique token, that will identify unique user on their end. — Ondrej Svejdar, Jul 26 '13 at 10:49
@harold - I have to pass to 3pty service some unique token (that has to stay the same for given user id), is encryption any better than hashing for such small input ? — Ondrej Svejdar, Jul 26 '13 at 11:01
Maybe, at least then the ability to see what the original ID was depends on access to the key instead of knowledge about the algorithm used (and honestly, if they see 16 or 20 bytes, the first thing anyone will try is assuming it's an MD5 or SHA1). But you might as well generate a new "external ID" for every user and use that, just put it in a DB somewhere. Could be a GUID or whatever. There is absolute no chance to reverse that to the original ID since it's simply not related. — harold, Jul 26 '13 at 11:20

score 1 · Accepted Answer · edited May 23 '17 at 10:27

For what you're trying to achieve (preventing a 3rd party from determining your user identifier), a straight MD5 or SHA1 hash is insufficient. 32 bits = about 4 billion values, it would take less than 2 hours for the 3rd party to brute force every value (@1m hashes/sec). I'd really suggest using HMAC-SHA1 instead.

As for collisions, this question has an extremely good answer on their likelihood. tl;dr For 32-bits of input, a collision is excessively small.

If your user identifiers aren't random (they increment by 1 or there is a known algorithm for creating them), then there's no reason you can't generate every hash to make sure that no collision will occur.

This will check the first 10,000,000 integers for a collision with HMAC-SHA1 (will take about 2 minutes to run):

public static bool checkCollisionHmacSha1(byte[] key){
    HMACSHA1 mac = new HMACSHA1(key);
    HashSet<byte[]> values = new HashSet<byte[]>();
    bool collision = false;
    for(int i = 0; i < 10000000 && collision == false; i++){
        byte[] value = BitConverter.GetBytes(i);
        collision = !values.Add(mac.ComputeHash(value));
        if (collision)
            break;
    }
    return collision;
}

Malkocoglu · Answer 2 · 2013-07-26T13:12:58.683

First, SHA1 is 20 bytes not 40 bytes.

Second, although input is very small, there still may be a collision. It is best to test this, but I do not know a feasible way to do that.

In order to prevent any potential collision:

1 - Hash your input and produce the 16/20 bytes of hash
2 - Spray your actual integer onto this hash. 
    Like put a byte of your int every 4/5 bytes.

    This will guarantee the uniqueness by using the input itself.

Also, take a look at Collision Column part

Are standard hash functions like MD5 or SHA1 quaranteed to be unique for small input (4 bytes)?

2 Answers2