2

From this post, the hash of a file in Git is computed

    Commit Hash (SHA1) = SHA1("blob " + <size_of_file> + "\0" + <contents_of_file>)

I tested it myself for two empty files to check whether it was correct:

    100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       empty1.txt
    100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       empty2.txt

But why does Git exclude the name of the file from the hash? How does it distinguish between empty1.txt and empty2.txt?

If I were to change the name of empty1.txt to empty2.txt, how does Git keep track of that change when I call git status?

Sentient
  • 781
  • 2
  • 10
  • 23
  • 1
    Git manages a working directory via a tree structure. In that tree structure, file names are mapped to SHA-1 blobs, in a dictionary fashion. So the tree maintains the relationship between file name and SHA-1, not the blob itself. – Tim Biegeleisen Nov 24 '17 at 01:47
  • Is this tree structure an explicit object (e.g. a HashMap)? And is this tree structure the same one from the structs of Git -- commits, trees, blobs, tags? – Sentient Nov 24 '17 at 01:52
  • 1
    I don't know what you mean by "explicit" object, but yes it stored somewhere. Yes it is one of the four objects Git uses. – Tim Biegeleisen Nov 24 '17 at 01:55
  • Ah that makes sense -- so when status is called, Git just makes comparisons between the tree of a commit and the tree of its parent, right? – Sentient Nov 24 '17 at 02:00
  • I'm not certain of that...my knowledge of Git implementation is fairly limited. I just wanted to point out that the blob SHA-1 is not the only thing which is used to identify a file, these blobs are part of a tree structure. – Tim Biegeleisen Nov 24 '17 at 02:02
  • Thanks Tim! I'm just trying to understand how the staging works: https://stackoverflow.com/questions/15765366/how-does-git-track-file-changes-internally – Sentient Nov 24 '17 at 02:05
  • Maybe you should start with [the documentation](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects). – larsks Nov 24 '17 at 02:41

1 Answers1

4

But why does Git exclude the name of the file from the hash? How does it distinguish between empty1.txt and empty2.txt?

Because Git manages content (and if the content of two files is identical, their SHA1 would be too).

The file names are managed by the tree (directory content), which lists the files in a given folder.

https://git-scm.com/book/en/v2/images/data-model-2.png

$ git cat-file -p 3c4e9cd789d88d8d89c1073707c3585e41b0e614
040000 tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579      bak
100644 blob fa49b077972391ad58037050f2a75f74e3671e92      new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a      test.txt
Tim Skov Jacobsen
  • 3,583
  • 4
  • 26
  • 23
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • So if I understand correctly, Git creates a tree with objects that are listed within the index. Is the type of an object (tree, blob) stored as well within that file along with its SHA-1? And then, how does Git use this index / tree to detect which files have been modified, untracked, or deleted? – Sentient Nov 24 '17 at 09:51
  • Is index a tree itself that hasn't been associated with a commit? When I open the index of my repository, I'm noticing it has been serialized. – Sentient Nov 24 '17 at 10:02
  • 1
    @SlackOverflow The type is part of the SHA1: https://stackoverflow.com/a/21361195/6309 – VonC Nov 24 '17 at 10:16
  • Edit: http://alblue.bandlem.com/2011/10/git-tip-of-week-understanding-index.html seems to suggest that is not the case. Still, I wonder how index makes fast comparisons without being an object. – Sentient Nov 24 '17 at 10:16
  • 1
    @SlackOverflow git status has improved recently: https://stackoverflow.com/a/43667992/6309 – VonC Nov 24 '17 at 10:26
  • Ah that was what I was thinking: index compares the modification time of each file, and if it differs, only then will it compute the SHA-1 hash of a file -- solely guessing off the fact that index keeps track of time metadata. – Sentient Nov 24 '17 at 10:40