1

I always wonder how Git stores directories, does Git following Linux's philosophy "anything is a FILE", then see directory as file to store?

Acorn
  • 24,970
  • 5
  • 40
  • 69
Tumb1eweed
  • 85
  • 6

2 Answers2

5

While AtnNn's answer is correct in terms of how the internal storage works, it's worth noting that Git builds these tree objects from the thing that Git calls its index or staging area or (rarely now) cache. The index is not capable of holding directories: it holds only files. The files in the index simply have long path names with embedded slashes, such as path/to/file.txt.

The git write-tree command reads through the index and splits this up:

  • It creates a tree object that will contain an entry for a blob object, held under the component-name file.txt. This tree object will acquire a hash ID once it is created. Let's call this hash ID H2.
  • It creates another tree object that will contain an entry named to. The entry for to will store hash ID H2. (It may contain more entries: it will contain one for each other path that begins with path/to/.) When git write-tree writes out this tree object, it will obtain a hash ID; let's call this hash ID H1.
  • It then creates another tree object that will contain an entry named path, which will store hash ID H1. (As before, it may contain more entries, such as one named README.md that will hold the hash ID of the blob containing the README.md file's content.) When git write-tree writes out this tree object, it will obtain a hash ID, which we can call H0.

The git write-tree command reports this hash ID H0 to its standard output.

The git commit-tree command uses this hash ID, plus additional information, to create a commit object. The commit object will have H0 as its tree. Hence the commit will refer to tree H0.

To read the commit into Git's index, git read-tree notes that there is a sub-tree named path inside H0, so it reads that sub-tree (hash H1) and finds that there's an entry named to giving H2. It therefore reads that sub-sub-tree and finds the entry named file.txt giving the blob hash ID for the file. It then writes path/to/file.txt into the index, storing the hash ID for the blob object.

While git commit and git checkout now have all of these steps built into them, you can still use git write-tree followed by git commit-tree to make a new commit. You can still use git read-tree to read a tree into Git's index, and then use git checkout-index to extract the files into a work-area. The index has no directory names in it! It has only file names. The checkout code will just create new directories when needed: that is, if Git needs to create a file named path/to/file.txt and there is no path yet, Git will make it. Now that there is a path, Git will make path/to as well if needed, and now that path/to/ exists, Git can create a file named file.txt within path/to/.

The fact that Git doesn't store directories in the index means that:

  • you have no way to store permissions for directories;1 and
  • there is no proper way to store an empty directory either.

There is a submodule trick that works for empty directories: see this answer to How can I add an empty directory to a Git repository?


1Since the only allowed file modes today are 100755 (executable) and 100644 (not-executable), there's no place to store group-write permission anyway. In the early days of Git, you could store a file as mode 100664 for instance, so it would have made more sense then. Note that on Linux, directories must be executable to use them, so while tree objects are stored as mode 40000, the actual on-disk inode has mode 040777 & ~umask, where 040000 is the S_IFDIR bit. See, e.g., https://docs.huihoo.com/doxygen/linux/kernel/3.7/include_2uapi_2linux_2stat_8h.html

torek
  • 448,244
  • 59
  • 642
  • 775
3

Git stores directories as tree objects which contain, for each entry in the directory, the mode, type, hash and name of the entry. For example, in a Git repository with a file and a folder at the root:

$ ls
example.txt
src/

$ git cat-file -p HEAD:
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    example.txt
040000 tree 87a2294c8c0351121cefbaef16cbe88dd2b64b80    src

The cat-file command shows the pretty (-p) version of the given object, HEAD:. The extra colon refers to the root directory of the branch. HEAD:src would refer to the src subfolder.

We can examine the raw directory data by passing tree instead of -p:

$ git cat-file tree HEAD: | hexdump  -C
00000000  31 30 30 36 34 34 20 65  78 61 6d 70 6c 65 2e 74  |100644 example.t|
00000010  78 74 00 e6 9d e2 9b b2  d1 d6 43 4b 8b 29 ae 77  |xt........CK.).w|
00000020  5a d8 c2 e4 8c 53 91 34  30 30 30 30 20 73 72 63  |Z....S.40000 src|
00000030  00 87 a2 29 4c 8c 03 51  12 1c ef ba ef 16 cb e8  |...)L..Q........|
00000040  8d d2 b6 4b 80                                    |...K.|

If the git repository isn't packed, this tree object will be stored in .git/objects. We can use rev-parse to find its hash:

$ git rev-parse HEAD:
cb8fd5fa2bf22ffa242d4e3fa520849551bbfa98

The zipped contents are the same data as above with a small prefix:

$ cat .git/objects/cb/8fd5fa2bf22ffa242d4e3fa520849551bbfa98 | zlib-flate -uncompress | hexdump -C
00000000  74 72 65 65 20 36 39 00  31 30 30 36 34 34 20 65  |tree 69.100644 e|
00000010  78 61 6d 70 6c 65 2e 74  78 74 00 e6 9d e2 9b b2  |xample.txt......|
00000020  d1 d6 43 4b 8b 29 ae 77  5a d8 c2 e4 8c 53 91 34  |..CK.).wZ....S.4|
00000030  30 30 30 30 20 73 72 63  00 87 a2 29 4c 8c 03 51  |0000 src...)L..Q|
00000040  12 1c ef ba ef 16 cb e8  8d d2 b6 4b 80           |...........K.|

And we can confirm that the hash is correct:

$ cat .git/objects/cb/8fd5fa2bf22ffa242d4e3fa520849551bbfa98 | zlib-flate -uncompress | sha1sum
cb8fd5fa2bf22ffa242d4e3fa520849551bbfa98  -

See the "Tree Objects" section of the documentation for more information.

Etienne Laurin
  • 6,731
  • 2
  • 27
  • 31