36

Recently a team of researchers generated two files with the same SHA-1 hash (https://shattered.it/).

Since Git uses this hash for its internal storage, how far does this kind of attack influence Git?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rudi
  • 19,366
  • 3
  • 55
  • 77
  • 1
    Possible duplicate of [Hash collision in git](http://stackoverflow.com/questions/10434326/hash-collision-in-git) – Tim Biegeleisen Feb 24 '17 at 07:38
  • Just for completenes: Linus answered some question regarding this topic [here](https://public-inbox.org/git/Pine.LNX.4.58.0504291221250.18901@ppc970.osdl.org/) and [here](http://marc.info/?l=git&m=148787047422954) – ckruczek Feb 24 '17 at 07:45
  • 1
    A few wonderful answers can be found here: [How would Git handle a SHA-1 collision on a blob?](http://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob) – dahlbyk Feb 24 '17 at 08:14
  • 2
    @TimBiegeleisen (and upvoter): I'd argue that this is not a duplicate as it's specifically about *the* (single) deliberate SHA-1 collision found recently, rather a theoretical discussion of the general idea. Of course a good theoretical discussion should subsume the question, but that requires that it be answered *ex post facto*, and existing questions obviously could not, in the past. :-) – torek Feb 24 '17 at 22:00
  • Isn't this question off-topic here on stackoverflow? The topic it self seems to be interesting but not suitable for stackoverflow isn't it? Review system said that this is not off-topic but I think that's not correct. – bkausbk Mar 21 '17 at 08:32

2 Answers2

40

Edit, late December 2017: Git version 2.16 is gradually acquiring internal interfaces to allow for different hashes. There is a long way to go yet.


The short (but unsatisfying) answer is that the example files are not a problem for Git—but two other (carefully calculated) files could be.

I downloaded both of these files, shattered-1.pdf and shattered-2.pdf, and put them into a new empty repository:

macbook$ shasum shattered-*
38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf
macbook$ cmp shattered-*
shattered-1.pdf shattered-2.pdf differ: char 193, line 8
macbook$ git init
Initialized empty Git repository in .../tmp/.git/
macbook$ git add shattered-1.pdf 
macbook$ git add shattered-2.pdf 
macbook$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

    new file:   shattered-1.pdf
    new file:   shattered-2.pdf

Even though the two files have the same SHA-1 checksum (and display mostly the same, although one has a red background and the other has a blue background), they get different Git hashes:

macbook$ git ls-files --stage
100644 ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0 0   shattered-1.pdf
100644 b621eeccd5c7edac9b7dcba35a8d5afd075e24f2 0   shattered-2.pdf

Those are the two SHA-1 checksums for the files as stored in Git: one is ba9aa... and the other is b621e.... Neither is 38762c.... But—why?

The answer is that Git stores files, not as themselves, but rather as the string literal blob, a blank, the size of the file decimalized, and an ASCII NUL byte, and then the file data. Both files are exactly the same size:

macbook$ ls -l shattered-?.pdf
...  422435 Feb 24 00:55 shattered-1.pdf
...  422435 Feb 24 00:55 shattered-2.pdf

so both are prefixed with the literal text blob 422435\0 (where \0 represents a single byte, a la C or Python octal escapes in strings).

Perhaps surprisingly—or not, if you know anything of how SHA-1 is calculated—adding the same prefix to two different files that nonetheless produced the same checksum before, causes them to now produce different checksums.

The reason this should become unsurprising is that if the final checksum result were not exquisitely sensitive to the position, as well as the value, of each input bit, it would be easy to produce collisions on demand by taking a known input file and merely re-arranging some of its bits. These two input files produce the same sum despite having a different byte at char 193, line 8, but this result was achieved, according to the researchers, by trying over 9 quintillion (short scale) inputs. To get that result, they put in carefully chosen blocks of raw data, at a position they controlled, that would affect the sums, until they found pairs of inputs that resulted in a collision.

By adding the blob header, Git moved the position, destroying the 110-GPU-years of computation in a single more or less accidental burp.

Now, knowing that Git will do this, they could repeat their 110-GPU-years of computation with inputs that begin with blob 422435\0 (provided their sacrificial blocks don't get pushed around too much; and the actual number of GPU-years of computation needed would probably vary, as the process is a bit stochastic). They would then come up with two different files that could have the blob header stripped off. These two files would now have different SHA-1 checksums from each other, but when git add-ed, both would produce the same SHA-1 checksum.

In that particular case, the first file added would "win" the slot. (Let's assume it's named shattered-3.pdf.) A good-enough Git—I'm not at all sure that the current Git is this good; see Ruben's experiment-based answer to How would Git handle a SHA-1 collision on a blob?—would notice that git add shattered-4.pdf, attempting to add the second file, collided with the first-but-different shattered-3.pdf and would warn you and fail the git add step. In any case you would be unable to add both of these files to a single repository.

But first, someone has to spend a lot more time and money to compute the new hash collision.

torek
  • 448,244
  • 59
  • 642
  • 775
  • 4
    but just to point out, while adding a _new_ file is not a security concern, replacing an existing blob in an important, compromised repository _is_. this would allow the insertion of a backdoor at an arbitrary point in history, even if every commit/tag is signed, without compromising the referential or cryptographic integrity of the repository. – strugee Mar 02 '17 at 20:39
  • that being said, see also Linus' response below. – strugee Mar 02 '17 at 20:40
  • Sure: if there is a bad blob in an existing repository, you want to replace it with a good one. If the good one has the same hash, you have a problem ... *unless* there are a near infinite number of new "good ones" with *other* hashes, which is likely the case. – torek Mar 02 '17 at 22:08
  • 1
    what? I can't parse your sentence. I'm talking about a malicious actor replacing a legitimate blob with a malicious one and nobody noticing; I'm not sure what you're trying to say about "infinite number of new 'good ones'"? – strugee Mar 03 '17 at 06:35
  • What I'm saying here is that if you have an existing repository with a good blob, you literally *can't* replace it with a bad one, at least not using Git commands (including fetch and push). I think maybe what you are thinking of is: "I, Evil Bob, clone Good Bob's repository G. Then I manufacture bad blob X with same hash as good blob Y, and construct an all-new repository B that uses the same hashes as G. Then I somehow convince you, the victim, to clone my B instead of Good Bob's G." That's pretty tricky, in multiple senses: why would I take yours instead of Good Bob's? – torek Mar 03 '17 at 08:52
  • Meanwhile, what I *thought* you were saying was: "I, Evil Bob, planned years in advance by sneaking a back door into massively-cloned repository E (for Evil). Now Good Bob can't fix it because he can't replace my historical bad blob!" Which is true: he can't replace the file with one with the same hash, but he doesn't *have* to. There is no need to assume evil as there are plenty of zero-day exploits in so much software. :-) – torek Mar 03 '17 at 08:56
  • 1
    ah, yeah. I was assuming the canonical repository had been compromised in some other way like e.g. through a GitHub vulnerability. – strugee Mar 03 '17 at 18:57
17

Maybe Linus' response might shed some light:

IIRC someone has been working on parameterizing git's SHA1 assumptions so a repository could eventually use a more secure hash. How far has that gotten? There are still many "40" constants in git.git HEAD.

I don't think you'd necessarily want to change the size of the hash. You can use a different hash and just use the same 160 bits from it.

Since we now have collisions in valid PDF files, collisions in valid git commit and tree objects are probably able to be constructed.

I haven't seen the attack yet, but git doesn't actually just hash the data, it does prepend a type/length field to it. That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header.

pdf's don't have that issue, they have a fixed header and you can fairly arbitrarily add silent data to the middle that just doesn't get shown.

So pdf's make for a much better attack vector, exactly because they are a fairly opaque data format. Git has opaque data in some places (we hide things in commit objects intentionally, for example, but by definition that opaque data is fairly secondary.

Put another way: I doubt the sky is falling for git as a source control management tool. Do we want to migrate to another hash? Yes. Is it "game over" for SHA1 like people want to say? Probably not.

I haven't seen the attack details, but I bet

(a) the fact that we have a separate size encoding makes it much harder to do on git objects in the first place

(b) we can probably easily add some extra sanity checks to the opaque data we do have, to make it much harder to do the hiding of random data that these attacks pretty much always depend on.

Linus

Source: https://marc.info/?l=git&m=148787047422954

Omar Ali
  • 8,467
  • 4
  • 33
  • 58
Mariano Anaya
  • 1,246
  • 10
  • 11