1

Git stores files as blobs, and then uses a SHA-1 checksum as a key to find each specific blob amongst the others, similar to a filename identifying a file.

So how does this dark magic work? That is, How does one start with a text file and end up with a blob? Is a blob created by dereferencing the memory memory address of the file or something?

Keenan Diggs
  • 2,287
  • 16
  • 17
  • 2
    [Git Plumbing and Porcelain](https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain), a must read if you really want to learn more about this. – Tim Biegeleisen Nov 25 '19 at 16:21
  • "Blobs" in git are files. One starts with a text file and finishes with a binary file. No magic. – phd Nov 25 '19 at 16:22
  • If you are using an operating system that doesn't distinguish between text files and binary files (like Unix), it's not even that complicated. You start with a file, Git hashes it, and uses the resulting hash to locate a copy of the file in the data base. – chepner Nov 25 '19 at 17:22
  • git-from-the-inside-out could answers very well to your question: https://codewords.recurse.com/issues/two/git-from-the-inside-out – Philippe Nov 25 '19 at 23:23

1 Answers1

5

There's very little actual magic in Git. The one bit that is pretty magic is (are?) the various Secure Hash Algorithm (SHA) checksum designs, Git's use of these checksums, and how they form a Merkle Tree, but this is more "math magic" than anything else.

I think you're really asking "how does Git come up with the hash ID", and the answer to that one is simple:

  • Find the size of the file, in bytes. Print this in decimal, e.g., 123.
  • Put the printed size in decimal after the word blob and a space. Append an ASCII NUL character, b'\0' in Python for instance. Hash the prefix and the data, and the result is the blob's hash ID:

    $ python3
    ...
    >>> data = b"some file data\n"
    >>> prefix = "blob {}\0".format(len(data)).encode("utf-8")
    >>> import hashlib
    >>> h = hashlib.sha1()
    >>> h.update(prefix)
    >>> h.update(data)
    >>> h.hexdigest()
    'a831035a26dd2f75c2dd622c70ee22a10ee74a65'
    

We can check by using Git's object hasher:

$ echo 'some file data' | git hash-object -t blob --stdin
a831035a26dd2f75c2dd622c70ee22a10ee74a65

The hashes match, so this is the blob hash for any file that consists solely of the 15-byte line "some file data" as terminated by a newline. Note that it is the content that determine the hash ID: the file's name here is irrelevant. (This means the file's name must be, and is, stored elsewhere—in Git, in one or more tree objects.)

(Note that SHA-1 is no longer considered cryptographically secure. Git is slowly being migrated to other hash algorithms, but there is no rush here. See How does the newly found SHA-1 collision affect Git?)

torek
  • 448,244
  • 59
  • 642
  • 775