How does SHA generate unique codes for big files in git

Question

Using Git I don't understand how using SHA you can generate just a 40 hexadecimal digit code that can then be mapped to any file which could be hundreds of lines long.

The way I'm thinking of it, lets say the string '1' -> 00...01, the string '2' -> 00..02, the string 'a34ed..fc' -> a34ed..fc etc so the hash map is returning itself then it's clear that all the hash codes get used up very quickly and any string 41 characters long will be reusing one of the codes.

Also I know it's known that SHA doesn't guarantee that it will always be unique but I don't see how it even comes close to being useful.

If it wasn't useful, git wouldn't work. But it obviously does. — mvp, Jan 15 '16 at 00:50
Yes, if you have 2^160 files in your repo, then SHA hashes will no longer work. — Oliver Charlesworth, Jan 15 '16 at 00:51
Actually, due to [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem), problems will start at `2^80` files. Which is still more than any physical constraint of our world. — mvp, Jan 15 '16 at 00:58
I thought it is the contents of the files being hashed, amount of files doesn't matter? Using my example a single file containing one 41 character string can't be uniquely hashed right. — user2802557, Jan 15 '16 at 01:02
Related: https://www.schneier.com/blog/archives/2015/10/sha-1_freestart.html the first documented SHA-1 near-collision took 10 days to occur on *a 64-GPU cluster specifically devoted to the task.* A collision will not happen in your git repository. You have much better odds of winning the PowerBall lottery repeatedly. — elixenide, Jan 15 '16 at 01:19
Of course I know I'm wrong about something because git uses this but I don't see why. From the example I tried to give is clear what my mistake is? — user2802557, Jan 15 '16 at 01:44
It doesn't map `a` to `a`, `b` to `b`, etc. It's likely that there are collisions at the 41-character length or with even shorter strings, but the point is that the algorithm makes them extremely improbable (not to mention functionally impossible to list). As @KeithThompson explained in his answer, there are `2^160` possible hashes. The odds of getting two identical hashes are infinitesimal, whether you're using 40-character plain texts or 20 MB plain texts. It just isn't going to happen. — elixenide, Jan 15 '16 at 03:53
Yes but by generality it could map a to a couldn't it though it wouldn't be a very good algorithm. Using letters of the alphabet just a 160 word string has 26^160 possibilities which is clearly >> 2^160 so why is 2^160 being talked about as if it is so high. Is it not the case that you need to get the unique text back from just the hash code? — user2802557, Jan 15 '16 at 07:58
@user2802557: Git doesn't get the file content from the hash. It stores both. — Keith Thompson, Jan 15 '16 at 15:53

Keith Thompson · Answer 1 · 2021-03-03T22:01:13.393

5

A SHA-1 hash is 160 bits long. That gives you 2¹⁶⁰, or exactly

1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976

possible hashes.

Assuming hash values are more or less unpredictable, the odds of two files accidentally having the same hash are infinitesimal to the point that it's just not worth worrying about it.

Quoting from Scott Chacon's book "Pro Git":

However, you should be aware of how ridiculously unlikely this scenario is. The SHA–1 digest is 20 bytes or 160 bits. The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about 2⁸⁰.

...

Here’s an example to give you an idea of what it would take to get a SHA–1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA–1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

It's true that there must be two 21-byte files that have the same SHA-1 hash (since there are 2¹⁶⁸ such files and only 2¹⁶⁰ possible SHA-1 hashes). ~~No such files have ever been discovered.~~

UPDATE : As of February 2017, two distinct PDF files with identical SHA-1 checksums have been generated, using a technique that's more than 100,000 times as fast as a brute force attack. Details here: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

Linux Torvalds (the author of Git) has posted a (preliminary) response here: http://marc.info/?l=git&m=148787047422954

Looking at the comments, it seems that the OP's original misunderstanding was an assumption that the SHA-1 hash could be used to determine the contents of the file. It can't. Git uses the SHA-1 has to construct the name of the file or other object. The file itself is stored somewhere under the .git/objects directory. For example, a file with a hash of

ff5a5eff8c90da934937165c9d0e9f96f9ecaf75

might be stored in

.git/objects/ff/5a5eff8c90da934937165c9d0e9f96f9ecaf75

-- and that file can be arbitrarily large. (It's not that simple, of course; git plays a lot of tricks to combine similar file and otherwise compress data.) Thanks to Patrick Schlüter for his comment.

edited Mar 03 '21 at 22:01

answered Jan 15 '16 at 01:00

Keith Thompson

254,901
44
429
631

But what if my repo contains all the files? - NSA – Mike Covington Jan 15 '16 at 01:06
@MikeCovington: It doesn't. – Keith Thompson Jan 15 '16 at 01:08
Yes I know that's a lot of possibilities but I still don't see how the example I gave doesn't use up all the possibilities in only 40 characters. What is it I'm missing here? – user2802557 Jan 15 '16 at 01:16
It's 20 bytes, not 40. But you don't have 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 different files. – Keith Thompson Jan 15 '16 at 01:18
Why is everyone talking about amount of files, seems irrelevant to me? I thought what SHA does it basically takes a string and converts it to 40 digit hexadecimal number, simple as that or am I wrong? But to me that is clearly not possible – user2802557 Jan 15 '16 at 01:26
Yes, it does take arbitrary byte stream and converts it into 160 bit hash. But, it tries really hard to make sure that hash value is spread evenly across all possible 2^160 values. To date, nobody was able to produce 2 different byte streams that have identical SHA1 hash. – mvp Jan 15 '16 at 01:52
1

@user2802557: It's a one-way conversion; you can't get the original file back from the 20-byte checksum. The hashes for two arbitrary files are not 100% certain to be distinct -- but they are *almost* 100% certain to be unique. If you have a million files in a repo, it's almost certain that they'll all have distinct SHA-1 hashes. And that's good enough for it to work. – Keith Thompson Jan 15 '16 at 01:54
Incidentally, I think this question has more down-votes than it should, although it would make more sense for it to be phrased differently and posted on crypto.stackexchange.com. :-) The usefulness of SHA-1 for git is that it meets five more constraints than the single one required of any hash function (determinism): 1: arbitrary inputs; 2: fast; 3: fixed output range; 4: reasonably uniform; and 5: non-invertible. The last one is the "one-way conversion" you mention. – torek Jan 15 '16 at 05:57
@torek It's downvoted because it lacks research effort and also because there are plenty of duplicates. – JBentley Jan 15 '16 at 06:50
@KeithThompson you can't get the contents of the file back from just the 20 byte code? Then how does git store things as them then with you able to get the contents back later? – user2802557 Jan 15 '16 at 08:01
1

It doesn't. The objects themselves are compressed with zlib, the sha1 of its content is computed and used as file name. The first 2 hex digits as directory, the 38 following digits as file name. If you look in .git/objects you can clearly see that the files are not 40 byte sized. – Patrick Schlüter Jan 15 '16 at 10:10
@user2802557: It stores both the SHA-1 hashes and the contents of the files, with data structures that let it retrieve whatever information it needs. – Keith Thompson Jan 15 '16 at 15:52
@PatrickSchlüter Thanks you're the only person whose actually answered where my mistake was – user2802557 Jan 17 '16 at 12:14
@PatrickSchlüter, from the [git book](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) seems that it compresses the the SHA itself (which I don't understand why). – LRDPRDX Mar 03 '21 at 10:19

torek · Answer 2 · 2016-01-15T06:33:06.957

In fact, what I call your "margin of safety" determines how many objects you can store.

The widely quoted "about 2⁸⁰" number is the point at which you have approximately a 50% chance of a hash collision. To keep the chance below about 1 out of 10¹⁸, the number of distinct objects in the repository should not exceed about 1.7 quadrillion (1.71x10¹⁵).

(I did some math for a book I'm working on; I haven't had it checked by a real mathematician, but when I ran the same sort of numbers against other hash sizes, my outputs agreed with those on Wikipedia, for whatever that's worth. :-) )

Edit to add: here's the approximation formula. Let r be the cardinality of the hash function (so r is 2¹⁶⁰ for SHA-1) and U be the desired probability-of-uniqueness (so U is 0.5 for the usual "50% chance of safety, 50% chance of collision" statistic. The maximum number of hash inputs is:

(1 + sqrt(1 + 8r ln (1 / U)) / 2

The natural log of 1 / .5 is about 0.693, so we have about sqrt(4r)/2, which is of course just about sqrt(r). Hence for a k-bit hash, "50% probability of uniqueness" occurs after about k/2 hashes.

To see (ballpark) how I get my number—in the neighborhood of 10¹⁵ objects—let U = 1 - 10^-18. The natural log of this number is basically the original 10^-18, which means we knock most of 2⁶⁰ off the range r, leaving about 2¹⁰⁰. The square root of that is about 2⁵⁰ which is about 10¹⁵.

You're right about accidental SHA-1 collisions, but deliberate SHA-1 collisions are rapidly becoming more feasible. https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html — Keith Thompson, Feb 24 '17 at 00:55
@KeithThompson: yes, I have mentioned this in some other answers. Interestingly, Mercurial has room for SHA-256 with no format changes. Git does not: `tree` objects in particular store binary SHA-1 values in 20 byte fields. Worse, numerous extra-Git scripts assume hashes are exactly 40 characters long (especially, that the null hash is 40 `0`s). — torek, Feb 24 '17 at 01:03

score 0 · Accepted Answer · answered Jan 16 '16 at 14:07

The mistake being made is that the SHA code is not used to generate the contents of any files, the contents are stored by Git separately. The SHA code is just used as a key to a commit. The reason commits can't just have keys just numbered from 1 and increasing is because with Git different people can work on different branches of the same project making commits without knowing about each other. When these get merged together we still need commits to have unique keys. The best way of making it so the keys will definitely be unique is using something like SHA which creates a unique code and as others have explained the probability of getting the same key is almost zero.

How does SHA generate unique codes for big files in git

3 Answers3

Linked