Linus talk - Git vs. data corruption?

Question

I watched Linus (creator of git) give a talk on git. At one point he talks about how git is safer. He also said that other SCMs can't deal with data corruption. So I googled it and found out that this is not true.

for example this link talks about "Replace the offending commit with a new commit altogether, re-creating approximately the same changes."

Maybe I misunderstood him, any idea what he meant?

He said, many times, that git is the ONLY SCM that let you checkout the same data you put in.

@unutbu yes, thx for finding that :) i hoped someone would watch it and find it ;P sorry for using you, but watching the whole think again just to find it was ... — IAdapter, Nov 12 '11 at 13:34

sehe · Answer 1 · 2011-11-13T00:35:42.027

Linus was referring to the fact that git commits are identifiable by their hash.

Git trees are objects consisting of multiple (trees, blobs) (read: blob=file, roughly).

The cryptographic hash of a parent node in is a hash of that of all underlying trees/blobs recursively. Such trees are known as Merkle (Hash) Trees and have the interesting property that the toplevel hash is a cryptographically strong hash that uniquely identifies the whole tree.

Note that the hash includes the commit attributes, and these include the parent ids. That is, if some file in some revision ever changes, the hash of the blob changes, therefore the hash(es) of the containing trees change, the hash of the snapshot (root tree) changes, the hash of the commit changes, therewith the hash of any child commits need to change and so on. All history will be altered.

If any of these rules are violated, it will be trivially detectable:

the hash of a single tree is deterministically verifiable in O(n) where n is the number of objects in the root tree
the integrity of a full branch history is deterministically verified in O(n) where n is the number of nodes in a revision chain.

In fact, git-verify-tag, git fsck are useful commands to do the checking explicitly. Besides that, verification automatically occurs on git subcommands (send-pack, receive-pack, read-tree, write-tree etc.)

Re: Replace the offending commit thread

In this first post by Linus he already deconstructs/defuses the bomb:

Hmm. Scary. That should not have been successful with a corrupt repo.

Unless you have done a .grafts file to hide the corruption, or something like that?

Which is immediately confirmed by Denis Bueno in the response.

added a response to the linked thread that talked about replacing a commit — sehe, Nov 12 '11 at 16:30
Do you mean that a hash is a signature? I think that's bit misleading. — Tamás Szelei, Nov 13 '11 at 00:24
@TamásSzelei: I was concerned about that confusion when I wrote it. No, I meant to refer to [Merkle signature scheme](http://en.wikipedia.org/wiki/Merkle_signature_scheme) but I'll edit the reference away since it doesn't add much to the clarity. Thx for the heads up — sehe, Nov 13 '11 at 00:34

score 3 · Answer 2 · edited May 23 '17 at 12:11

I think he was referring to the fact that git uses a cryptographic hash to ensure data correctness, and that it stores snapshots rather than changesets. Saying that git is the only SCM that does so, is probably an overstatement today, but it might have been true in the past, before the advent of DVCS systems. Note that the term "snapshot" does not mean it stores the entire files. See this answer for details.

Linus talk - Git vs. data corruption?

2 Answers2

Re: Replace the offending commit thread

Linked