0

Ok so here's the use case. I have lots of somewhat lengthy (200-500 character) strings that I'd like to have a smaller deterministic hash for. Since I can store the full 160-bit SHA1 value in a mere 20 bytes, this yields an order of magnitude space improvement per string.

But of course one has to worry about collisions with hashing on strings even with a crypto hash with decent avalanche effects. I know the chances are infintesimely small, but I'd like to be more conservative. If I do something like this:

hash(input) = CONCAT(HF1(input),HF2(input))

where HF1 is some suitable robust hashing f() and HF2 is another distinct but robust hashing f(). Does this effectively make the chance of a collision near impossible (At the cost of 40 bytes now instead of 20)? NOTE: I am not concerned with the security/crypto implications of SHA-1 for my use case.

CLARIFICATION: original question was posed about a hashing the concatenated hash value, not concatenating hashes which DOES NOT change the hash collision probabilities of the outer hash function.

omnisis
  • 1,208
  • 8
  • 9
  • @Oli: what if I concatenate the two hashes(losing some space in the process)? Still doesn't help? – omnisis Jan 11 '13 at 03:30
  • @Matt: Edited original question to make it more clear. – omnisis Jan 11 '13 at 03:37
  • Considering SHA-x is designed to well-distribute "any" input .. I don't see what advantage the pre-hashes would result in. That is, **a change to a *single* bit in the SHA-x input will change the output significantly**. (If anything, I would imagine the pre-hashing would be *detrimental* if they could be exploited - on purpose or by accident - due to not being "as robust" as SHA-x or otherwise collaping the input space.) –  Jan 11 '13 at 03:37
  • `CONCAT(SHA(x),SHA(f(x))`, where `f` is some function that alters - but does not collapse - the input (e.g. reverse), will result in a larger output space, yes. Will this even matter? I don't know, but highly doubt it. –  Jan 11 '13 at 03:45
  • @pst: looking at the actually probabilities I am now convinced that the chance is so slim that it's silly to worry about. – omnisis Jan 11 '13 at 03:56

2 Answers2

3

Assuming "reasonable" hash functions, then by concatenating, all you're doing is creating a hash function with a larger output space. So yes, this reduces the probability of collision.

But either way, it's probably not worth worrying about. 2^320 is something like the number of particles in the universe. So you only need to worry if you're expecting attackers.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • what if I concatenated two distinct hashes on the same string? Is it not true that the probability of two distinct hash f()s yielding the same value for a given string is less than the probability of only one of them doing the same? – omnisis Jan 11 '13 at 03:35
  • @omnisis: Assuming they're reasonable independent hashes, yes, that's true. But in your question, you're then passing this result into another hash function. – Oliver Charlesworth Jan 11 '13 at 03:36
  • @omnisis: Hmm, you've now completely changed the question. Obviously, a longer hash function is much less likely to see collisions than a shorter hash function (assuming they're both "reasonable"). – Oliver Charlesworth Jan 11 '13 at 03:40
  • Yeah sorry about that. One more....Should I care about using a crypto like hashing function for at least one of the hashes or can I "get away" with a 2 functions that are smaller? – omnisis Jan 11 '13 at 03:43
  • @omnisis: I don't know, but I doubt it matters! Either way, the chance of a collision is 1 in 2^160, which is orders of magnitude greater than the number of particles in the universe... – Oliver Charlesworth Jan 11 '13 at 03:45
  • If you post answer a new answer I'll give you credit...Again..sorry – omnisis Jan 11 '13 at 03:46
  • IIRC the estimate on particles in the universe is 10^100 ~= 2^300. But otherwise this is great. – templatetypedef Jan 11 '13 at 03:51
0

I asked the wrong question initially. This was probably the question I was looking for:

Probability of SHA1 collisions

This was also illuminating

Understanding sha-1 collision weakness

I guess it's fair to ask if I had two hash functions whose concatenated size was smaller than 20 bytes say 2 distinct 32-bit hashing functions. If concatenating those produces a probability that is small enough to ignore in practice since 2 (or even 3) of those concatenated would be smaller than SHA-1.

Community
  • 1
  • 1
omnisis
  • 1,208
  • 8
  • 9