Why is the git index file binary?

Question

Most of the files in my Git directory are plain text files (except for the compressed loose objects and the packfiles). So I can just cat and edit files like .git/HEAD or .git/refs/heads/master and inspect the repository if it gets corrupted.

But the .git/index is a binary file. Wouldn't a plain text file be more useful because it can easily be modified by hand?

Scott Chacon shows in his presentation the following image (Slide 278): Index by Scott Chacon

In my opinion, this can easily be put to a plain text file.

So why is it a binary file rather than a plain text file?

The answers in http://stackoverflow.com/q/4084921/6309 can help. — VonC, Dec 02 '14 at 09:50
@VonC I can just see an explanation about the structure of the binary file. Am I missing something? — das_j, Dec 02 '14 at 09:51
"So why is it a binary file rather than a plain text file?": the answers shows how the structure of an index is a binary. — VonC, Dec 02 '14 at 09:56
@VonC But it just stores three hashes per file, the modification time, and the filename. Does this really need to be indexed? — das_j, Dec 02 '14 at 10:10
Yes, for performance reason. It works with index entries (https://github.com/git/git/blob/867b1c1bf68363bcfd17667d6d4b9031fa6a1300/Documentation/technical/index-format.txt#L38) and cached trees (https://github.com/git/git/blob/867b1c1bf68363bcfd17667d6d4b9031fa6a1300/Documentation/technical/index-format.txt#L132-L138): It helps speed up tree object generation from index for a new commit. — VonC, Dec 02 '14 at 10:13

Jazimov · Answer 1 · 2017-05-01T03:55:55.497

None of the reasons given by the answer adequately addresses the question posed, which is "Why is the Git index file binary?". The accepted answer is just not correct. The index doesn't "contain" any plain-text files--it contains references to plain-text files. Furthermore, to say that the Git index contains "index entries" says really nothing useful at all, especially to a fellow developer seeking Truth... Finally, trees are not cached by the index--references to the trees are cached.

The index isn't binary because it's "indexed" (as the poster concluded in a comment above)--and it isn't binary for "performance reasons", per se. Everything in the index could be expressed using a pure text file--even the flags and bits expressed within the binary index file could be expressed as ASCII. It's binary because binary file formats that contain bit-wise flags are able to use disk space more efficiently. And, knowing Linus, it probably is binary so as to dissuade tampering by newbies with easy-access to text editors.

* New information * Version 4 of the index implements path compression, saving up to roughly 50% on the size of the index for large repos. (Source: https://git-scm.com/docs/git-update-index) This compression would lend itself to a binary-format index file.

Interesting. +1. I tried to amend my answer to make it a bit less incorrect or nonsensical. — VonC, Apr 28 '17 at 07:18

score 3 · Accepted Answer · edited May 23 '17 at 12:17

The index, as presented in "What does the git index contain EXACTLY?" contains metadata and, as noted below by Jazimov, references:

index entries: references to entries, with metadata (time, mode, size, SHA1, ...)
cached trees, that references to trees ("pre-computed hashes for trees that can be derived from the index"), which helps speed up tree object generation from index for a new commit.

The concatenation of those data makes it a binary file, although the actual reason is pure speculation. Not being able to modify it by hand could by one.

Why is the git index file binary?

2 Answers2