Why does Git merge the index file instead of completely overriding it when checking out a commit

Question

This is the question about Git internals. I use low-level commands here and don't use branches.

Setup

echo f1 > f1.txt
echo f1 > f1.txt
git add .
git commit -m "first"
...cf178d5

Now I want to create a new commit with one file f3.txt using index and write-tree command:

$ rm f1.txt f2.txt
$ echo ‘f3 content’ > f3.txt
$ git add .

So currently the index file and the directory contains only new f3.txt file:

$ git ls-files -s
100644 [some hash] 0       f3.txt

$ ls
f3.txt

This what causes the weird behavior later

So I write the tree to the repository and update HEAD so with the new commit hash:

LATEST_TREE_HASH=$( git write-tree )
echo $LATEST_TREE_HASH > .git/HEAD

If I now run git status I get:

$ git status
Not currently on any branch.
nothing to commit, working directory clean

Question

If I now check out first commit with two files f1.txt and f2.txt:

$ git checkout cf178d5
A       f3.txt                    <--------------- why?
HEAD is now at a27a75a... initial commit

Git works fine but I believe it merges trees in the index instead of overriding. You can see it from the git checkout output that it treats f3.txt as added file and if I check the index file contents:

$ git ls-files -s
100644 [some hash] 0       f1.txt
100644 [some hash] 0       f2.txt
100644 [some hash] 0       f3.txt

$ ls 
f1.txt f2.txt f3.txt

It shows three files. What is the reason for this behavior?

By overriding, the staged changes are lost. By merging, the user can choose to commit the changes or discard them. Losing unsaved changes is a disaster. Merging conflict in the index will always abort the checkout. — ElpieKay, Aug 23 '17 at 14:36
@ElpieKay, thanks, so what are you saying, it's not merging, it's something else here? — Max Koretskyi, Aug 23 '17 at 14:45
After *"Now I want to create a new commit..."* you run `rm f1.txt f2.txt` and several lines below, without doing any checkout, `ls` shows the files are still in the working tree. Are you sure you posted the correct code? — axiac, Aug 23 '17 at 14:56
Also, `git commit -m "second commit"` is, in fact, the first commit. It contains only `f3.txt` because the other two files where removed from the working tree by `rm f1.txt f2.txt` (they were still in the index at that time) and then from the index by `git add .` — axiac, Aug 23 '17 at 14:59
@axiac, I managed to simplify the setup. The problem is definitely with manual update of the HEAD. Can you take a look please now — Max Koretskyi, Aug 23 '17 at 15:49

torek · Accepted Answer · 2017-08-23T17:18:41.270

Edit: the question has changed enough to invalidate the previous response.

There's still a typo (f1.txt listed twice) and funky non-ASCII Unicode quote marks, but we can now see what is going wrong here:

$ LATEST_TREE_HASH=$( git write-tree )
$ echo $LATEST_TREE_HASH > .git/HEAD

This is a bit of a problem. As Mark Adelsberger noted in a comment and your script says by using the word TREE here, git write-tree writes a tree, not a commit.

Why this is a problem

What's in .git/HEAD is supposed to be exactly one of two things:

a string of the form ref: refs/heads/name, where name is a valid branch name, or
the hash ID of a commit object.

In turn, a branch name—a reference of the form refs/heads/name—must always point to a commit object, never to a blob, tree, or tag object.

This means that Git in general assumes that whatever comes out of .git/HEAD, it refers to a commit object. By writing this tree hash into .git/HEAD you've violated this assumption. However, to allow for "unborn branches", such as the state of an initial repository with no master yet, HEAD can contain the name of a branch that does not actually exist.

What happens next is, I think, not guaranteed. The git checkout command assumes that if HEAD contains a valid hash, it contains a commit hash, and the only other allowed possibility is that HEAD contains the name of an orphan branch. So we run git checkout target_hash, as in your example:

git checkout cf178d5

Case 1: moving from commit to commit

Suppose HEAD contained a valid commit hash. Let's call this the old hash, as distinguished from the target commit hash. In this case, git checkout would:

Compare (recursively as needed for sub-trees) the contents of the tree of old to the contents of the tree of target.¹
For each hash that must change, including being added or removed, check whether the index and/or work-tree file version in the current index and work-tree match those in old.
If all match, update the index hashes and copy the new files to the work-tree (or remove the files from the work-tree and remove the index entry, if appropriate).
Otherwise (some files don't match): complain and refuse the checkout.

Obviously --force disables the check, but this is the basic process by which both staged and unstaged modifications are carried from one checkout to another when switching branches without being in a "clean" state. The process is described in all its gory detail in the Two Tree Merge section of the git read-tree documentation.

Case 2: moving from orphan branch to commit

The other possibility allowed by the rules is that you are currently on an orphan branch. In this case, there is no current commit. Most likely, Git simply uses the empty tree as if it were the current commit. It then follows the same rules for case 1, which is now allowed since it has a tree.

But this is, obviously, not guaranteed. If Git were to use the current (valid) tree stored in .git/HEAD as the base tree, instead of the empty tree, and then proceed as for case 1, you'd see your two files get removed. Follow all the sub-cases outlined in git read-tree with $H set to your existing tree, vs $H set to the empty tree. (I admit to not having done so, but I think this is where the behavior comes from. But see also the remark about case 3 in the read-tree documentation!)

¹Git actually achieves this using a temporary index, stored in the index.lock file. If all goes well, the temporary index is renamed to become the regular index, unlocking the index in the process. If things go poorly, Git removes the temporary index.lock file, discarding the temporary index and unlocking the index.

Original answer (to somewhat different question)

There's another set of funky non-ASCII quote marks that made cut and paste of your instructions fail, so that when I made the normal first commit I ended up with two files:

$ git commit -m "second commit"
[master (root-commit) b9c7e4b] second commit
 2 files changed, 1 insertion(+)
 create mode 100644 f3.txt
 create mode 100644 f3.txtcontentecho
$ ls
f3.txt                  f3.txtcontentecho
$ git ls-files -s
100644 5927d85c2470d49403f56ce27afd8f74b1a42589 0       f3.txt
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       f3.txtcontentecho

But note that my ls vs ls-files -s output differs enormously from yours at this point:

So currently the index file and the directory contains only new f3.txt file:
$ git ls-files -s
100644 [some hash] 0       f3.txt

$ ls
f1.txt f2.txt

It's not at all clear to me why you would have files f1.txt and f2.txt in your work-tree now; I don't.

Now we create a commit with git commit-tree and run git checkout:

$ INITIAL_COMMIT_HASH=$( \
>     echo 'initial commit' | git commit-tree $INITIAL_TREE_HASH )
$ git checkout $INITIAL_COMMIT_HASH

but what I get is very different:

Note: checking out 'cd1bc16160c8a2814cd94bc8397230ffe5a16c22'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at cd1bc16... initial commit

and everything reads as I would expect (files f1.txt and f2.txt are in the work-tree and the index; neither of the f3 files are visible). Running git log --graph --all shows the expected two (disconnected) commits (both are root commits, with no parents).

thanks, _It's not at all clear to me why you would have files f1.txt and f2.txt in your work-tree now; I don't._ - yeah, that's a typo, I get `f3.txt`. Let me go through the code again, maybe I missed something, I was copying it — Max Koretskyi, Aug 23 '17 at 15:10
I managed to simplify the setup. The problem is definitely with manual update of the `HEAD`. Can you take a look please now? — Max Koretskyi, Aug 23 '17 at 15:49
@MaximKoretskyi - `git write-tree` creates a `TREE`, not a `COMMIT`. After you remove the first two files and create the 3rd file, you aren't committing anything (according to the steps you've written in the current version of your question); so the contents of the index are "changes staged for commit", the fact that you have a random tree object in the database notwithstanding. — Mark Adelsberger, Aug 23 '17 at 16:04
@MarkAdelsberger, thanks, I know that `write-tree` creates a tree, not a commit. Actually if I run `git status` after I update the `HEAD`, I can see the following `Not currently on any branch. nothing to commit, working directory clean`. I added these details to the question. So git doesn't report any changes staged for commit. This is because git compares the index file with the tree in `INDEX` and since I updated the INDEX they are the same — Max Koretskyi, Aug 23 '17 at 16:15
thanks, I've accepted your answer. It seems that your main point is that putting a tree hash into the HEAD will lead to an undefined behavior, correct? The two cases you outlined are not really relevant to my case, correct? However, I'm very curios to understand the CASE1. I have some questions regarding the operations, do you know if there's a thread that shows these operations on stackoverflow? If not, I will probably create a new question. Thanks a lot for your time! — Max Koretskyi, Aug 23 '17 at 18:27
Yes, Git is very insistent on having `HEAD` name a valid commit (or nothing at all), so much so that some operations that update `refs/heads/*` names refuse to do so if you attempt to point them to a tag (I tried this as an experiment once). For the read-tree "two tree merge" case, follow the links (the answer text has embedded links). — torek, Aug 23 '17 at 18:47