2

AFAIK git just trusts sha1 hash to be unique, so it doesn't handle cases when two hashes will clash (I know, it's unlikely).

Out of curiosity, what would be a good way to handle conflicts like this? I'm thinking about checking file size if it matches.

Is there anyone capable of doing some math to determine by how much will I decrease chance of conflict?

Or am I overthinking this and SHA1 is good enough in practice?

graywolf
  • 7,092
  • 7
  • 53
  • 77

3 Answers3

3

SHA1 is good enough in practice. Under slightly optimistic assumptions the chance of a collision between two different objects is about one out of 2160. Due to the "birthday problem" the chance rises remarkably fast with more objects, but it takes a great deal of rise to get anywhere near your chances of being struck by lightning: about 1 out of 700,000 in any given year, or 1 out of 3000 in your lifetime (same link). (Yes, I know those numbers don't seem to work together. I'm just using their numbers.)

Eventually, though, git should probably switch to a higher-number-of-bits variant of SHA. This will break all those scripts that assume that the SHA-1 is always exactly 40 characters. :-)

torek
  • 448,244
  • 59
  • 642
  • 775
  • according to your link, 2^(N/2) == 2^80, that's pretty high number but still, using sha512 instead would help :) I'll probably have lot's of hashes, so it could be useful. As a side note, is there any pair of hash functions that guarantee collison won't happen for same input? So I can just use pair of hashes instead of just one hash? Is it even theoretically possible? – graywolf Oct 10 '14 at 22:31
  • 1
    How many hashes per second do plan on creating? Multiply that number by 365*24*60*60, then divide the result into 2^80 and that tells you how long you have until you need to worry. I don't think using sha512 would help anything but chew up disk space. – Andrew C Oct 10 '14 at 23:02
  • @Paladin: no, by the pigeonhole principle there's no way to number more items uniquely than you have numbers. Hashes reduce a large space to a smaller one, so there must be *some* values that produce collisions; if you use one N bit hash and then one M bit hash, the new space has at most N+M values and if the input space is larger than N+M there will still be collisions. The issue with SHA-1 is not so much *accidental* collisions as *intentional* ones: it's probably possible to compute a deliberate hash collision that will break a git repo, given enough CPU time. – torek Oct 11 '14 at 00:36
3

At some point I wrote and calculated the following (no idea if my math was correct or not)

Birthday Paradox – for 365 buckets, you have a 50% chance of collision at 23, and a 99.9% chance at 70, and a 99.9999 at 100, but you don’t hit 100% until 366 (leap years excluded). A normal person might guess that you wouldn’t be at 50% until approximately 180.

For the SHA addresss space of 160 bits you are at 50% chance of collision when you have added 1.42*10^24 objects. Git does not handle collisions, and instead presumes they will never happen.

Note that Git puts 4 different types of objects into the same address space - commits, blobs, trees, tags. Git already checks that the content being hashed is identical so checking file sizes doesn't do anything.

There are a number of techniques for dealing with hash collisions. They are always employed when your hash output space is small (like 16bit or less). Multiple-hash algorithms or buckets are the most popular I've seen in practice.

Andrew C
  • 13,845
  • 6
  • 50
  • 57
1

Is there anyone capable of doing some math to determine by how much will I decrease chance of conflict?

Or am I overthinking this and SHA1 is good enough in practice?

You're overthinking it. It's not possible to explain why without explaining just how unlikely it is to happen, and the odds are so ... we don't even have hyperbole for how unlikely it is.

Here, try this: at five-card stud, your odds of getting a royal flush are 1 in 649739.

So you're playing five-card stud, and somebody at the table gets a royal flush.

Then the next hand, he gets another royal flush.

Then the next hand, he gets another royal flush.

Then the next hand, he gets another royal flush.

Then he has to beat you at rock-paper-scissors four times in a row.


Here's another one. Say the Linux kernel project is generating a million files a second. If it had been doing that, every second, since the big bang, it'd still be short of the 2^80 files needed to reach 50-50 odds.


Or: you go down to the corner store and buy one, just one, mega-millions lottery ticket. Come Saturday, you find out you've won a million dollars!.

Heck, if you're going to be that lucky, you should probably buy a lottery ticket. So you go down to the corner store and buy one, just one, mega-millions lottery ticket. Come Wednesday, you find out you've won another million dollars!

Heck, if you're going to be that lucky, you should probably buy a lottery ticket. So you go down to the corner store and buy one, just one, mega-millions lottery ticket. Come Saturday, you find out you've won yet another million dollars!


However hopeful you are, however many dreams you'd hang on the chance that you'll be the lucky one in any of those scenarios, that's about how much you should worry.

jthill
  • 55,082
  • 5
  • 77
  • 137