61

Context: I downloaded a file (Audirvana 0.7.1.zip) from code.google to my Macbook Pro (Mac OS X 10.6.6).

I wanted to verify the checksum, which for that particular file is posted as 862456662a11e2f386ff0b24fdabcb4f6c1c446a (SHA-1). git hash-object gave me a different hash, but openssl sha1 returned the expected 862456662a11e2f386ff0b24fdabcb4f6c1c446a.

The following experiment seems to rule out any possible download corruption or newline differences and to indicate that there are actually two different algorithms at play:

$ echo A > foo.txt
$ cat foo.txt
A
$ git hash-object foo.txt 
f70f10e4db19068f79bc43844b49f3eece45c4e8
$ openssl sha1 foo.txt 
SHA1(foo.txt)= 7d157d7c000ae27db146575c08ce30df893d3a64

What's going on?

svick
  • 236,525
  • 50
  • 385
  • 514
twcamper
  • 633
  • 1
  • 5
  • 6
  • There’s a good article on this at http://progit.org/book/ch9-2.html – Josh Lee Mar 13 '11 at 21:33
  • The location of the book changed: http://git-scm.com/book/ch9-2.html#Object-Storage (and I'm not able to edit comments on SO). – riezebosch Mar 21 '14 at 14:50
  • Possible duplicate of [Assigning Git SHA1's without Git](http://stackoverflow.com/questions/552659/assigning-git-sha1s-without-git) || http://stackoverflow.com/questions/7225313/how-does-git-compute-file-hashes – Ciro Santilli OurBigBook.com May 17 '16 at 09:25

5 Answers5

81

You see a difference because git hash-object doesn't just take a hash of the bytes in the file - it prepends the string "blob " followed by the file size and a NUL to the file's contents before hashing. There are more details in this other answer on Stack Overflow:

Or, to convince yourself, try something like:

$ echo -n hello | git hash-object --stdin
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0

$ printf 'blob 5\0hello' > test.txt
$ openssl sha1 test.txt
SHA1(test.txt)= b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0
Community
  • 1
  • 1
Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • 4
    Why have git authors chosen this behavior? – liori Mar 13 '11 at 15:59
  • 1
    liori: I can only speculate. I've added an answer showing how it is used in one special case, but I doubt that's the only reason. – araqnid Mar 13 '11 at 16:10
  • 3
    @liori: I guess it is to make sure that you don't have a blob that has the same object name (SHA1sum) as a commit or a tree, etc. - each has (at least) their type prepended before the hash is calculated. – Mark Longair Mar 13 '11 at 16:11
  • 5
    Also, the "blob \0" (or similar) at the start of the file means you can tell the type of the object very quickly just by decompressing the first bytes of the object file. There's more about the compression and what's actually written to disk in this section of the [nice chapter of Pro Git on Git Objects](http://progit.org/book/ch9-2.html#object_storage). – Mark Longair Mar 13 '11 at 16:17
  • 2
    @liori: it makes sense that git would use the sha-1 this way since its purpose is version control of file trees, which is not the purpose of cmd line utils like sha1sum or md5sum. – twcamper Mar 13 '11 at 16:42
  • 2
    @liori: All types of "objects" in git (blobs, commits, tags and trees), are named by a hash. There's a command `cat-file -t`, e.g `git cat-file -t a7bb6fb0` tells you the "type" of the object whose name (hash) starts with a7bb6fb0... It can do this because the actual object (stored in the repository, compressed) starts with "blob" or "tree" or whatever. You can see the object with a command like `python -c "import zlib,sys;print repr(zlib.decompress(sys.stdin.read()))" < .git/objects/a7/bb6fb0*`. Anyway the summary is that git's name is the hash of the git "object", not just the blob inside. – ShreevatsaR Aug 07 '13 at 04:54
6

The SHA1 digest is calculated over a header string followed by the file data. The header consists of the object type, a space and the object length in bytes as decimal. This is separated from the data by a null byte.

So:

$ git hash-object foo.txt
f70f10e4db19068f79bc43844b49f3eece45c4e8
$ ( perl -e '$size = (-s shift); print "blob $size\x00"' foo.txt \
               && cat foo.txt ) | openssl sha1
f70f10e4db19068f79bc43844b49f3eece45c4e8

One consequence of this is that "the" empty tree and "the" empty blob have different IDs. That is:

e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 always means "empty file" 4b825dc642cb6eb9a060e54bf8d69288fbee4904 always means "empty directory"

You will find that you can in fact do git ls-tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904 in a new git repository with no objects registered, because it is recognised as a special case and never actually stored (with modern Git versions). By contrast, if you add an empty file to your repo, a blob "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391" will be stored.

araqnid
  • 127,052
  • 24
  • 157
  • 134
3

Git stores objects as [Object Type, Object Length, delimeter (\0), Content] In your case:

$ echo "A" | git hash-object --stdin
f70f10e4db19068f79bc43844b49f3eece45c4e8

Try to calculate hash as:

$ echo -e "blob 2\0A" | shasum 
f70f10e4db19068f79bc43844b49f3eece45c4e8  -

Note using -e (for bash shell) and adjusting length for newline.

2

The answer lies here:

How to assign a Git SHA1's to a file without Git?

git calculates on file metadata + contents, not just contents.

That is a good enough answer for now, and the takeaway is that git is not the tool for checksumming downloads.

Community
  • 1
  • 1
twcamper
  • 633
  • 1
  • 5
  • 6
0

Take care to filters !

git is actually filtering the file before calculating the sha. Typically \r\n end of lines are converted to \n. this is why you may have different results between git hash-object and git hash-object --no-filters some other stuff may be filtered and .gitattributes can have an impact on the results.

little example using windows cmd :

create test files in a new folder:

$ echo this is a test $Id$ > test1.txt
$ echo this is a test $Id: ffbf88668784c14e809c8c449d799b654d7a5fc5 $ > test2.txt

now use git hash-object

$ git hash-object test1.txt
0c3a75d8155d54c2367e290cf7f33434805410be

$ git hash-object test2.txt
60fff1b8ec47ed41254719681e32369d640d6a0f

$ git hash-object --no-filters test2.txt
2f68d9b80a38fb800f039ef9062c764d2a4d4352

different files leads to different hashes : OK but git does somehow filter the file as --no-filters has an impact.

now create a git repo and .gitattributes in the folder:

$ git init .
Initialized empty Git repository in ~/.git

$ echo *.txt ident > .gitattributes

$ git hash-object test1.txt
0c3a75d8155d54c2367e290cf7f33434805410be

$ git hash-object test2.txt
0c3a75d8155d54c2367e290cf7f33434805410be

$ git hash-object --no-filters test2.txt
2f68d9b80a38fb800f039ef9062c764d2a4d4352

Now test1 and test2 have the same hash ! but --no-filters option is still giving the same value.

Conclusion: you can get the same hash with git and openssl but you need to make sure that your file is not impacted by git filters.

lmutricy
  • 1
  • 1