168

I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.

Do I need to worry about collisions in the MD5 hash value that gets produced?

Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?

MPelletier
  • 16,256
  • 15
  • 86
  • 137
Ben Throop
  • 4,772
  • 5
  • 23
  • 20
  • 1
    Related: [Are there two known strings which have the same MD5 hash value?](https://crypto.stackexchange.com/q/1434/2592) – kenorb Mar 01 '18 at 16:47
  • 2
    The literal answer is that the _second_ file could have the same MD5 as the first. However the odds are extremely small. – Rick James Mar 15 '18 at 12:23

8 Answers8

317

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Kornel
  • 97,764
  • 37
  • 219
  • 309
  • 4
    Not strictly true. The probability of a collision is much higher than this as a new URL could potentially collide with any existing item in the table. See [This posting](http://stackoverflow.com/questions/2145510/python-random-is-barely-random-at-all) (disclaimer, I wrote it) for a run-down on the maths, and a small python script that can be adapted to compute the probability for a particular number of URLs. – ConcernedOfTunbridgeWells Apr 15 '11 at 14:43
  • @ConcernedOfTunbridgeWells: I did take correction for birthday paradox, which is why answer is in billions, not quintillions. I was unable to verify probability with your script `PV=2**128; SS=2**64`: `OverflowError: long int too large to convert to int` – Kornel Apr 15 '11 at 15:44
  • 20
    *"probability of collision is 1/2^64"* - what? The probability of collision is dependent on the number of items already hashed, it's not a fixed number. In fact, it's equal to exactly `1 - sPn/s^n`, where `s` is the size of the search space (`2^128` in this case), and `n` is the number of items hashed. What you are probably thinking of is `2^64`, which is the approximate number of items you'd need to MD5 hash to have a 50% chance of collision. – BlueRaja - Danny Pflughoeft May 20 '13 at 18:34
  • @BlueRaja-DannyPflughoeft that's what I had in mind indeed. Thanks for the correction. – Kornel May 20 '13 at 19:31
  • 7
    Unfortunately, you are still not correct. You are assuming that the hash function is truly random. It is not. This means that the collision probability is higher. – Jørgen Fogh Feb 16 '15 at 13:09
  • 22
    JørgenFogh: And all laws of physics are "not correct" either. Such level of pedantism is unnecessary because it doesn't change the answer in any meaningful way. – Kornel Feb 16 '15 at 23:56
  • 1
    (This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.); incorrect. this means that by the _time_ you've been hashing 6 billion files per second for 100 years, 50% of the hashes you are generating would collide with previously-generated hashes. – Ry Biesemeyer Nov 10 '17 at 19:14
  • 2
    @yaauie No, that's ridiculously impossible. I'm talking about generating 2^64 hashes out of 2^128 possible ones. That's *one quadrillionth of a percent* of all possible hashes generated. – Kornel Nov 11 '17 at 15:41
  • Intuitively if we ignore the birthday paradox and just look at an approximate solution: Add `2^64` hashes into a list. Now add one more hash to that list. That one more hash has `1 / 2^128` times `2^64` chance of a collision, i.e. that one more hash has a `1 / 2^64` chance of a collision. Now add another `2^64` hashes to the list and you should get a collision. Do the same calculation for `2^63` (and note `2^63 + 2^63 = 2^64`). – robocat Mar 15 '18 at 01:54
  • 23
    So you’re saying there’s a chance! – vargonian May 18 '18 at 08:49
  • Can I use this hash algorithm for filenames? Like hash the contents of files, set the name of those files to their respective hashes and store them in a directory? Maximum number of files in the directory at the same time is around 3000. – Amirhosein Al Jul 29 '20 at 07:05
  • @AmirhoseinAl yes, for all practical purposes it will be as unique as the filenames. – Kornel Jul 29 '20 at 11:32
  • do this means "Don't worry" ? As my DB primary key are MD5 hashes ! – Anurag Vohra Sep 12 '20 at 14:43
  • 1
    @AnuragVohra Yes, you don't have to worry. The most probable collision there is an asteroid hitting earth. – Kornel Sep 16 '20 at 02:16
  • If we take `2^64` random hashes out of `2^128`, then according to the approximated formula from [Birthday attack](https://en.wikipedia.org/wiki/Birthday_attack) we have [0.39](https://www.wolframalpha.com/input/?i=1+-+e%5E%28-%282%5E64%29%5E2%2F%282%282%5E128%29%29%29) chance of _at least one value is chosen more than once_, whereas for [2.2 * 10^19](https://www.wolframalpha.com/input/?i=1+-+e%5E%28-%282.2+*+10%5E19%29%5E2%2F%282%282%5E128%29%29%29) hashes to choose we have 50% chance of at least one collision (see the table in the article) – bartolo-otrit Oct 07 '20 at 14:12
28

S3 can have subdirectories. Just put a "/" in the key name, and you can access the files as if they were in separate directories. I use this to store user files in separate folders based on their user ID in S3.

For example: "mybucket/users/1234/somefile.jpg". It's not exactly the same as a directory in a file system, but the S3 API has some features that let it work almost the same. I can ask it to list all files that begin with "users/1234/" and it will show me all the files in that "directory".

Eriawan Kusumawardhono
  • 4,796
  • 4
  • 46
  • 49
davr
  • 18,877
  • 17
  • 76
  • 99
19

So wait, is it:

md5(filename) + timestamp

or:

md5(filename + timestamp)

If the former, you are most of the way to a GUID, and I wouldn't worry about it. If the latter, then see Karg's post about how you will run into collisions eventually.

kenorb
  • 155,785
  • 88
  • 678
  • 743
Ryan
  • 9,918
  • 7
  • 42
  • 57
  • 1
    Please elaborate on how including the timestamp increases the chance of collision – Bradley Thomas Sep 23 '14 at 03:34
  • 15
    @BradThomas : It does not. The MD5 risk of collision is the same whether it is on the filename or the combination of filename+timestamp. But in the first scenario, you would need to have both a MD5 collision and a timestamp collision. – Vincent Hubert Nov 13 '14 at 15:10
  • 3
    This still leaves a 2^(128^60) chance of a collission with two users per minute. Literally unusable. – Berry M. Nov 29 '16 at 12:47
  • 3
    @BradThomas To be clearer: `md5(filename) + timestamp` reduces the collision risk massively because you would need to have an md5 collision for exactly the same timestamp to have a collision overall. `md5(filename + timestamp)` is the same as `md5(filename)`, assuming that filename is random to start with (because adding more randomness to something random only changes the individual md5 result and the birthday problem still exists across all the md5 hashes). – robocat Mar 15 '18 at 01:38
10

A rough rule of thumb for collisions is the square-root of the range of values. Your MD5 sig is presumably 128 bits long, so you're going to be likely to see collisions above and beyond 2^64 images.

Will Dean
  • 39,055
  • 11
  • 90
  • 118
7

Although random MD5 collisions are exceedingly rare, if your users can provide files (that will be stored verbatim) then they can engineer collisions to occur. That is, they can deliberately create two files with the same MD5sum but different data. Make sure your application can handle this case in a sensible way, or perhaps use a stronger hash like SHA-256.

bdonlan
  • 224,562
  • 31
  • 268
  • 324
  • using a salt would take care of the user engineering problem, no? – StackOverflowed Oct 13 '14 at 16:45
  • It depends on how the salt is applied. It would need to be a prefix of the user-supplied data, or better yet the key for an HMAC. It's still probably a good idea to practice defense in depth though. – bdonlan Dec 08 '14 at 06:56
  • Note although SHA256 is 256 bits long, you can trade off the risk of collisions with the length of the key you are storing by truncating the SHA256 to fewer bits e.g. use SHA256 but truncate it to 128 bits (which is more secure than using MD5 even though they have the same number of bits). – robocat Mar 15 '18 at 01:57
5

While there have been well publicized problems with MD5 due to collisions, UNINTENTIONAL collisions among random data are exceedingly rare. On the other hand, if you are hashing on the file name, that's not random data, and I would expect collisions quickly.

acrosman
  • 12,814
  • 10
  • 39
  • 55
  • The only problem I have with taylors example is that if someone gets a copy of your database they could probably figure out the credit card numbers using a rainbow table ... – Sam Saffron May 04 '09 at 02:42
  • 1
    While I wouldn't choose to use MD5 for credit cards, a Rainbow table of all valid credit card numbers between 10,000,000 (8 digits being the smallest length credit card I've seen) and 9,999,999,999,999,999 (largest 16 digit number) is still a big table to generate. There are probably easier ways to steal those numbers. – acrosman May 04 '09 at 14:01
2

Doesn't really matter how likely it is; it is possible. It could happen on the first two things you hash (very unlikely, but possible), so you'll need to support collisions from the beginning.

Karg
  • 1,467
  • 3
  • 15
  • 18
  • 37
    There may of course be many other bad things which can happen with a probability of 1/2^128. You might not want to single-out this one to worry about. – Will Dean Oct 14 '08 at 15:47
  • 2
    The worst thing that can happen here is you can get a photo. For a relatively small number I would not worry. Now if your software is controlling an autopilot landing an aircraft, thats another story. – Jim C Oct 14 '08 at 15:54
  • 9
    You can't be serious. You'll need to hash 6 billion files per second, every second for 100 years to get good chance of collision. Even if you're very very unlucky, it would probably take more than entire capacity of S3 used for longer than a human lifetime. – Kornel Nov 13 '08 at 22:36
  • 13
    It's billions of times more likely that your database and its backups will all fail. Collisions are not worth worrying about. – Artelius Sep 07 '09 at 01:31
  • 6
    Use the collision prevention time building a bunker to put your server! Those pesky meteors can hit you (very unlikely, but possible), so you'll need to support meteor shelter from the begging. – polvoazul May 22 '18 at 21:12
  • 2
    It would take 100 years to get a **50%** chance of collision at 6G files / sec. You have a **good** chance of collision decades earlier. – user327961 May 24 '18 at 18:46
  • 1
    Bad thing is that it someone could upload colliding files ON PURPOSE, which may lead to bugs or even worse - security breach, for example it could allow to override the file with other file. https://www.avira.com/en/blog/md5-the-broken-algorithm – rumata28 Sep 07 '20 at 08:53
1

MD5 collision is extremely unlikely. If you have 9 trillion MD5s, there is only one chance in 9 trillion that there will be a collision.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    Many of the other Answers talk about the probability of a collision when adding _one_ more item. I think my Answer are more useful because it talks about the probably of the entire table having a dup. – Rick James Sep 05 '17 at 12:23
  • 1
    This has nothing to do with MD5 and is not correct. It's like saying that if you have 9 trillion cats there is a 1 in 9 trillion chance that someone else has a identical cat. The key problem here is that you can get same hash with more than one value. – Joonas Alhonen Jan 24 '19 at 13:07
  • @JoonasAlhonen - Yes, that is true. And a lot of poor people use that as an excuse to buy yet another Lottery ticket they cannot afford. – Rick James Jan 25 '19 at 06:38