21

Trying to determine how quickly a user would be warned of corruption in the object database with git-1.7.4.1, I pulled a one-bit switcheroo:

$ git init repo
Initialized empty Git repository in /tmp/repo/.git/
$ cd repo
$ echo 'very important info' >critical
$ git add critical
$ git commit -m critical
[master (root-commit) c4d6d90] critical
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 critical
$ git ls-tree HEAD
100644 blob 82d423c32c4bb2c52938088e0234db041bf4eaaf    critical
$ git show 82d423c32c4bb2c52938088e0234db041bf4eaaf
very important info
$ echo 'Very important info' | git hash-object --stdin -w
81a3797afe76d339db25c0f9c705a6caa47279c2
$ mv .git/objects/81/a3797afe76d339db25c0f9c705a6caa47279c2 \
     .git/objects/82/d423c32c4bb2c52938088e0234db041bf4eaaf

Of course, git-fsck notices

$ git fsck
error: sha1 mismatch 82d423c32c4bb2c52938088e0234db041bf4eaaf

error: 82d423c32c4bb2c52938088e0234db041bf4eaaf: object corrupt or missing
missing blob 82d423c32c4bb2c52938088e0234db041bf4eaaf

but git-log is happy with the change

$ git log -p
commit c4d6d90467af9ffa94772795d5c5d191228933c1
Author: Greg Bacon <gbacon@dbresearch.net>
Date:   Thu Apr 7 12:20:53 2011 -0500

    critical

diff --git a/critical b/critical
new file mode 100644
index 0000000..82d423c
--- /dev/null
+++ b/critical
@@ -0,0 +1 @@
+Very important info

as is git-checkout.

$ rm critical 
$ git checkout .
$ cat critical 
Very important info

A specific invocation of git-show reveals the corruption

$ git show 82d423c32c4bb2c52938088e0234db041bf4eaaf
error: sha1 mismatch 82d423c32c4bb2c52938088e0234db041bf4eaaf

fatal: bad object 82d423c32c4bb2c52938088e0234db041bf4eaaf

but not a broader one.

$ git show
commit c4d6d90467af9ffa94772795d5c5d191228933c1
Author: Greg Bacon <gbacon@dbresearch.net>
Date:   Thu Apr 7 12:20:53 2011 -0500

    critical

diff --git a/critical b/critical
new file mode 100644
index 0000000..82d423c
--- /dev/null
+++ b/critical
@@ -0,0 +1 @@
+Very important info

Even git-clone doesn't notice!

$ cd ..
$ git clone repo clone
Cloning into clone...
done.
$ cat clone/critical 
Very important info

What is the full list of specific git command modes (e.g., git show $sha1 should be present but not git show or git show HEAD) that perform integrity checks?

Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • I'm curious - is this just general curiosity, or are you trying to accomplish something specific? (Unfortunately, I have no idea without digging into the source about the actual answer.) – Cascabel Apr 07 '11 at 18:56
  • 1
    @Jefromi The motivation was a [discussion about backups](http://www.reddit.com/r/programming/comments/gk15g/never_trust_your_version_control_backups_why/) on r/programming. Someone [objected](http://www.reddit.com/r/programming/comments/gk15g/never_trust_your_version_control_backups_why/c1o4w1a) to every git clone being a restored backup with “But of course this is relying on Git to not have any flaws that could corrupt all the versions of a file.” I incorrectly assumed ordinary use would quickly warn about corruption, even for this pathalogical case. – Greg Bacon Apr 07 '11 at 22:17
  • I think the moral is: If you're using git as a backup system, and your main repository fails, you should do a `git fsck` after restoring it from the backup. Plus it would probably be good to send out an email saying "We've restored the central repo. `master` is at , and `release` is at " to make sure that everyone's clones agree. – Tyler Apr 08 '11 at 05:00
  • Git will immediately tell you about the usual mode of corruption: a flipped a bit (or other change) in the actual on-disk loose object file or pack file. What you have demonstrated is that Git does not automatically give notice of complete replacement of one valid loose object with another valid loose object. Your scenario is interesting, but it seems like a less likely failure mode (though certainly possible if there is a systematic bug in something that you are using to copy/clone repositories). You may want to bring this up on the Git mailing list. – Chris Johnsen Apr 09 '11 at 04:15

3 Answers3

6

Here's how I would go about finding this out, although I'm not going to go through each source file to work out the conditions under which the check is performed. :)

Clone git's source code:

git clone git://git.kernel.org/pub/scm/git/git.git

Check out the version you care about:

cd git
git checkout v1.7.1

Look for that error message:

git grep 'sha1 mismatch'

That leads you to object.c and the parse_object function. Now look for that function:

git grep parse_object

... and go through the 38 files checking the conditions under which that function will be called.

Mark Longair
  • 446,582
  • 72
  • 411
  • 327
6

In response to Mark Longair's answer, I fired up cscope and found:

(note how cscope has a curses interface and integrates nicely into Vim in case your interest was piqued)

Functions calling this function: parse_object

  File              Function                       Line
0 bundle.c          verify_bundle                   110 struct object *o = parse_object(e->sha1);
1 bundle.c          create_bundle                   242 struct object *object = parse_object(sha1);
2 bundle.c          create_bundle                   247 struct object *object = parse_object(sha1);
3 bundle.c          create_bundle                   323 obj = parse_object(sha1);
4 commit.c          lookup_commit_reference_gently   30 struct object *obj = deref_tag(parse_object(sha1), NULL, 0);
5 http-backend.c    show_text_ref                   372 struct object *o = parse_object(sha1);
6 http-push.c       one_remote_object               742 obj = parse_object(sha1);
7 http-push.c       add_remote_info_ref            1530 o = parse_object(ref->old_sha1);
8 log-tree.c        add_ref_decoration               93 struct object *obj = parse_object(sha1);
9 merge-recursive.c get_ref                        1664 object = deref_tag(parse_object(sha1), name, strlen(name));
a pack-refs.c       handle_one_ref                   43 struct object *o = parse_object(sha1);
b pretty.c          format_commit_one               835 parse_object(commit->object.sha1);
c reachable.c       add_one_reflog_ent              122 object = parse_object(osha1);
d reachable.c       add_one_reflog_ent              125 object = parse_object(nsha1);
e reachable.c       add_one_ref                     133 struct object *object = parse_object(sha1);
f reflog-walk.c     fake_reflog_parent              234 commit_info->commit = (struct commit *)parse_object(reflog->osha1);
g refs.c            peel_ref                        647 o = parse_object(base);
h refs.c            write_ref_sha1                 1452 o = parse_object(sha1);
i remote.c          ref_newer                      1482 o = deref_tag(parse_object(old_sha1), NULL, 0);
j remote.c          ref_newer                      1487 o = deref_tag(parse_object(new_sha1), NULL, 0);
k revision.c        add_head_to_pending             166 obj = parse_object(sha1);
l revision.c        get_reference                   176 object = parse_object(sha1);
m revision.c        handle_commit                   196 object = parse_object(tag->tagged->sha1);
n revision.c        handle_one_reflog_commit        855 struct object *o = parse_object(sha1);
o server-info.c     add_info_ref                     12 struct object *o = parse_object(sha1);
p sha1_name.c       peel_to_type                    508 o = parse_object(sha1);
q sha1_name.c       peel_to_type                    511 if (!o || (!o->parsed && !parse_object(o->sha1)))
r sha1_name.c       peel_onion                      573 o = parse_object(outer);
s sha1_name.c       peel_onion                      578 if (!o || (!o->parsed && !parse_object(o->sha1)))
t sha1_name.c       handle_one_ref                  698 struct object *object = parse_object(sha1);
u sha1_name.c       get_sha1_oneline                740 if (!parse_object(commit->object.sha1))
v tag.c             deref_tag                        16 o = parse_object(((struct tag *)o)->tagged->sha1);
w tree.c            parse_tree_indirect             271 struct object *obj = parse_object(sha1);
x tree.c            parse_tree_indirect             284 parse_object(obj->sha1);
y upload-pack.c     got_sha1                        342 o = parse_object(sha1);
z upload-pack.c     reachable                       382 parse_object(commit->object.sha1);
A upload-pack.c     receive_needs                   526 object = parse_object(sha1);
B upload-pack.c     send_ref                        644 struct object *o = parse_object(sha1);
C upload-pack.c     mark_our_ref                    670 struct object *o = parse_object(sha1);
D walker.c          loop                            182 parse_object(obj->sha1);
sehe
  • 374,641
  • 47
  • 450
  • 633
  • 3
    Cool, though it's hard to guess where some of those lower-level things are ultimately used. You'd have to do a lot more following to really figure it out. – Cascabel Apr 08 '11 at 05:39
0

Git 2.38 (Q3 2022) adds more on parse_object()

The server side that responds to "git fetch"(man) and git clone(man) request has been optimized by allowing it to send objects in its object store without recomputing and validating the object names.

See commit 945ed00 (07 Sep 2022), and commit 9a8c3c4, commit 0bc2557, commit c868d8e (06 Sep 2022) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 8b2f027, 13 Sep 2022)

parse_object(): allow skipping hash check

Signed-off-by: Jeff King

The parse_object() function checks the object hash of any object it parses.
This is a nice feature, as it means we may catch bit corruption during normal use, rather than waiting for specific fsck operations.

But it also can be slow.
It's particularly noticeable for blobs, where except for the hash check, we could return without loading the object contents at all.

Now one may wonder what is the point of calling parse_object() on a blob in the first place then, but usually it's not intentional: we were fed an oid from somewhere, don't know the type, and want an object struct.
For commits and trees, the parsing is usually helpful; we're about to look at the contents anyway.
But this is less true for blobs, where we may be collecting them as part of a reachability traversal, etc, and don't actually care what's in them.
And blobs, of course, tend to be larger.

We don't want to just throw out the hash-checks for blobs, though.
We do depend on them in some circumstances (e.g., rev-list(man) --verify-objects uses parse_object() to check them).
It's only the callers that know how they're going to use the result.
And so we can help them by providing a special flag to skip the hash check.

We could just apply this to blobs, as they're going to be the main source of performance improvement.
But if a caller doesn't care about checking the hash, we might as well skip it for other object types, too.
Even though we can't avoid reading the object contents, we can still skip the actual hash computation.

If this seems like it is making Git a little bit less safe against corruption, it may be.
But it's part of a series of tradeoffs we're already making.
For instance, "rev-list --objects" does not open the contents of blobs it prints.
And when a commit graph is present, we skip opening most commits entirely.
The important thing will be to use this flag in cases where it's safe to skip the check.
For instance, when serving a pack for a fetch, we know the client will fully index the objects and do a connectivity check itself.
There's little to be gained from the server side re-hashing a blob itself.
And indeed, most of the time we don't! The revision machinery won't open up a blob reached by traversal, but only one requested directly with a "want" line.
So applied properly, this new feature shouldn't make anything less safe in practice.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250