4

say I have a blob of text 5000 characters. I run it through a hashing program and generates a 40 char long hash. now i run another blob of text, 10000 characters. it still generates a hash 40 chars long. that's true for text of any length.

my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?

chat
  • 1,789
  • 3
  • 14
  • 11

12 Answers12

18

Hashing is not unique.

Hashing is a technique to attempt to generate a unique hash for each value fed to it, but it is not guaranteed unique.

Good hashing algorithms will have duplicate hash values much less frequently than bad hash algorithms. Also, hashing is one directional - meaning you can't go from a hash -> original, so it's not meant for compression.

And: A hash doesn't need to be unique. The same input needs to be tranformed into the same hash by the algorithm. You don't use a hash as identifier!

markus
  • 40,136
  • 23
  • 97
  • 142
Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
9

Not all hashes are guaranteed to be unique. The wikipedia entry on the topic is pretty good: http://en.wikipedia.org/wiki/Hash_function

pdwetz
  • 841
  • 1
  • 9
  • 19
  • To add to this, if you know your entire possible input set, you can use generate the perfect hash for it. – GManNickG Apr 13 '09 at 22:31
8

One way to think of a hash is like a human fingerprint (hashes are also sometimes referred to as fingerprints)..

You can "compress" any person in to a (pretty much) unique finger-print.. but, you cannot know who someone is by their fingerprint alone.. This is just like a hash, you can easily work out hash("abcdef") -> a1b2c3, but given only a1b2c3, you cannot trivially tell the source data.

To reverse a finger print, you need to compare the fingerprint to a database of known people->finger-prints (if the unknown fingerprint matches Person1, the unknown fingerprint belongs to them)

With a hash, again you must do much the same thing - you have a database with all string->hash mappings (called a rainbow table). Then you lookup the row with the hash "a1b2c3" and it shows "abcdef" was hashed in order to get this. The other more common way is to simply try every combination of characters, hash them and compare (a brute force attack)

Finally, while human fingerprints are "unique", it's possible to have two the same, it's just incredibly unlikely - it's the same with hashing... Some hashing algorithms are more susceptible to collisions than others.

my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?

Theoretically hashing is a great compression method, but to decompress is incredibly impractical beyond (say) 10 ASCII characters of data.. You're right, you can compress anything to a 40 character string, but you cannot decompress it practically (even theoretically is a bit of a stretch..)

dbr
  • 165,801
  • 69
  • 278
  • 343
  • 4
    No, you can't use it as a compression system at all. Given a 40 character hash, there will be millions of (for example) 50 character strings for every hash value. It happens that it is very, very hard to find any of them, but you couldn't hope to find the "right" one. – Chris Jefferson Apr 15 '09 at 22:07
  • Git uses SHA-1 to generate fingerprints to each file and refers to them by hashes. So what will be, when 2 files returns the same hash? – Bartłomiej Szałach Dec 27 '12 at 21:49
  • 1
    @bzxcv17 The chances of that happening are 1 in 2^52 which is highly unlikely. Look up SHA-1 hash collisions for more. – asheeshr Dec 28 '12 at 05:10
  • 1
    @bzxcv17 If two files have the same hash, they probably have the same content - so git just points both files to the same object. This is good, as it means the file's content is only stored once. The [Pro Git book](http://git-scm.com/book/en/Git-Internals-Git-Objects) has a great chapter explaining – dbr Dec 28 '12 at 05:12
  • 1
    @bzxcv17: I don't know how Git would deal with it, but it is very improbable. Assuming that hashes have an even distribution, the chance of getting a particular hash is 1 in 2**160, or 1 in 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976. That's very unlikely. See [this question](http://stackoverflow.com/questions/1867191/probability-of-sha1-collisions) for more information about the probability of collisions. – icktoofay Dec 28 '12 at 05:12
  • 1
    @icktoofay `echo blah > file1.txt; echo blah > file2.txt` and both will have same same hash... I don't think @bzxcv17 was asking about hash collisions, but rather how git deals with multiple files with the same content (which the [Pro Git link answers nicely](http://git-scm.com/book/en/Git-Internals-Git-Objects)). Git ignores the possibilities of hash-collisions, as far as I'm aware (there's no sane way to handle it, and as you say, it's incredibly unlikely - so a non-issue) – dbr Dec 28 '12 at 05:41
  • But it is theoretically possible that 2 different files return the same hash (I know it's highly unlikely, but possible). Does git have any security for this behaviour? – Bartłomiej Szałach Dec 28 '12 at 10:57
  • 1
    @bzxcv17 [Nope](http://stackoverflow.com/questions/10434326/hash-collision-in-git), but the [implications have been considered](http://kerneltrap.org/mailarchive/git/2006/8/28/211065), and collisions are not a big problem (even if the collisions were easier to cause) – dbr Dec 28 '12 at 11:09
5

RSA hashes are not unique. There's an extremely tiny (something to the order of 1:36^40) chance that you will create a false positive when hashing two different bits of clear text. For most applications, the chance is considered sufficiently small that you can ignore it, since it would take, on average, millions of years to see an accidental collision.

Yes - that Jake.
  • 16,725
  • 14
  • 70
  • 96
3

Hashing is for spreading as good as possible, not for uniqueness!

Of course, reaching uniqueness is a reaching a 100% spreading, but that's often not possible, no matter how good your hashing algorithm.

Striking example :

C# for example provide an Int32 code for each object as HashCode... So also for Int64 :

       Int64 a = Int64.MaxValue;
       Int32 myHash =  a.GetHashCode();

Conclusion here : there are 2^64 different possible instances of Int64, but only 2^32 hashcodes for them!!

So : each hashvalue for an Int64 is shared by (average)

4 294 967 295

other Int64's!

So much for uniqueness hey :-)

Peter
  • 47,963
  • 46
  • 132
  • 181
1

Consider looking at this from the point of view of the Pigeonhole Principle. If you're stuffing n items into a smaller number of buckets k, there will necessarily be some buckets with multiple items. So to answer your question, no hashes are not unique.

Richard Pistole
  • 626
  • 5
  • 7
1

Hashing is not guaranteed to be unique, but if you are looking for a unique hash, look at gperf. It can generate a unique hashing function for a set of predetermined inputs.

Zifre
  • 26,504
  • 11
  • 85
  • 105
0

I think this is a great explanation: http://www.cprogramming.com/tutorial/computersciencetheory/hash-table.html

Shen
  • 216
  • 4
  • 4
0

You can compress the signature of any text into a hash, but you cannot reverse calculate what the text was to give you that hash. Simply speaking the only way to find out what the text was that gave you the hash would be to brute-force text through the hash to try and find a match.

See Wikipedia

stevehipwell
  • 56,138
  • 6
  • 44
  • 61
  • Again : only if the ranges are equal.. Otherwise bruteforce would not give you a 100% certain solution, and, depending on the ranges, it could be far, far less.. – Peter Apr 13 '09 at 22:41
  • Brute-force will always give you the correct answer, it just may also give you false positives. The probability of false positives for a well defined hash is statisticaly highly improbable. – stevehipwell Apr 14 '09 at 08:00
0

Don't get confused by the .Net GetHashCode(). It's not very good as it's only 32 bits compared to 640 bits in the original question (if each character is 8 bits).

stevehipwell
  • 56,138
  • 6
  • 44
  • 61
  • 1. I'm not feeling so confused, tx, it's exactly the same principle, the blob in the example is also quite bigger than the hash 2. This is no answer on the question at all! You should make a comment instead – Peter Apr 13 '09 at 22:29
  • Sorry Peter, don't have the required reputation (50) to comment. But I'd recommend that you read Microsoft's comments on their GetHashCode() as "the default implementation of this method must not be used as a unique object identifier for hashing purposes". – stevehipwell Apr 13 '09 at 22:50
0

If you are using a well defined hash function correctly you can practically assume that hash results are unique.

Issue, with your question is a hash is a one way function. There is no inverse function to take a value and go back to your original blob. Unless you keep a huge table of all possible original values (so called rainbow table).

Szere Dyeri
  • 14,916
  • 11
  • 39
  • 42
  • That's only true if the range of the hash is equally big as the range of the original value, see example of int32/64 – Peter Apr 13 '09 at 22:32
  • Your answer is incorrect. The issue with the question is not the one way function! – Peter Apr 13 '09 at 22:33
0

They are not unique, but you're much more likely to drop dead of a heart attack before you find two different documents with the same hash for a high quality algorithm, e.g. SHA-1

David Plumpton
  • 1,929
  • 23
  • 31