Why does git behave this way? Inconsistency between OS and VM accessing the same repository

Question

Allow me to explain the setup.. I have a PC (Windows, not sure if that is a variable here or not) which has a git repository. This repoitory works and behaves as expected, for the sake of this question assume a single file has been updated and not yet committed to the current branch. I also have a VM (Linux) on the box. The VM can access the filesystem through a share and mounted drive back to the host OS. On the VM git has been installed and the git repository is authenticated.

On the VM I can see the current branch from a git branch command. However if I request the list of uncommitted files via git status or git add --dry-run . I get a list of every file -not the single uncommited file that I expect to see. Another hint I found is that if I conduct a long running process, say git add --dry-run . while this process is chugging along if I were to run the same command on the host OS I would get an error about the git lock file (which tells me that they are using the same filesystem/database). I assumed that this could be caused by the host NTFS filesystem being case insensitive and the guest filesystem EXT4 being case sensitive but I do see that the case of the files do match each other and are reported the same by git.

So the question is why does the guest OS show committed files as uncommitted?

May be related to How does git status work internally?

Also I should mention that git has a few different line ending options, I aligned these to match each other. — user1529413, Aug 24 '18 at 13:02
Normally when you access a singular git repo you do so from the same OS, as you clone it locally. Maybe there's a clash between the different file paths? `\ ` vs `/` — evolutionxbox, Aug 24 '18 at 13:17
Check the `core.autocrlf` setting on both machines: `git config core.autocrlf`, see what it says. — Lasse V. Karlsen, Aug 24 '18 at 13:18
You can also have differences when the execution bit of each file is considered different on different OSes. `git config --global core.filemode false` will permanently dismiss these differences. — Romain Valeri, Aug 24 '18 at 13:23
I did not consider the slashes, I am using bash on both systems... hmm. The core.autocrlf, matches on the machines, I did play with these settings quite a bit. — user1529413, Aug 24 '18 at 14:52
@EdwardThomson I don't have a .gitattributes file (no in user settings) can you explain on your idea a little? — user1529413, Aug 24 '18 at 14:53
@RomainVALERI `git config --global core.filemode false` did not resolve it, great idea though, it may indeed be part of the problem — user1529413, Aug 24 '18 at 14:56
I added a .gitattributes file with `* text=auto`, it seemed to change the files to include the errror message `The file will have its original line endings in your working directory.` — user1529413, Aug 24 '18 at 15:01

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Git does not really have a notion of "uncommitted file". What it does have is the index and the work-tree.

The main thing Git stores are commits:

Commits are permanent (mostly¹), completely-read-only entities stored in a database of sorts (a simple key-value store, really) that allow Git to access the complete snapshot of the source you, or the committer, made when you, or the committer, made that commit. Along with that snapshot, you—I'll leave out the "or the committer" from here on but of course it is implied—get a chance to add your own metadata, specifically, the log message about why you made that commit.

The "true name" of any commit is its hash ID. Git uses the hash ID as the key in the key-value store, to retrieve the commit. Each commit also contains the hash ID of its predecessor or parent commit (or, for merge commits, two or more parent hash IDs—this is what makes them "merge commits").

One commit is always the current commit. That's the one commit you selected (via git checkout) to work with. Because commits are read-only, you cannot change this commit. What you can do is, at some point, make a new commit. Normally, this new commit will use the current commit as the new commit's parent, and then become the current commit, and this is why you can always get back every file you ever committed: commits are permanent (mostly) and read-only (completely) and remember their parents.

The files stored with a commit—the snapshot you made—are saved in a compressed, Git-only format that is not useful to anything other than Git. So these files must be extracted from each commit before you can use them. Hence Git also has:

A work-tree. Here, Git can extract the files from a commit into the format in which the computer uses them. These files should not be shared across computers, not because it cannot work, but because it can and this just makes for big headaches, as you are discovering.

Since files in the work-tree are stored in the native format, and are used by other programs, Git offers the ability to modify the files—specifically things like line-endings and permissions bits—as they come out of a commit on the way into the work-tree, and as they go from the work-tree into a commit. But there's one more key item and this is where the biggest headaches come from.
The index. This item sits between the current commit, and the work-tree.

The index stores all the files in their special Git-only format. It starts out containing the files as they were when they were committed. The key difference between the commit's copy of the files and the index's copy is that you can change the ones in the index. You change them by replacing them wholesale, using git add to copy the work-tree file back into the index.

When you make a new commit, Git simply uses whatever is in the index at that time. All the files are already there, all pre-packaged in the Git-only format. This makes committing very fast.

What this also means is that the transformation from Git-only format to "useable by this computer" format, and vice versa, happens on the copy from index to work-tree (which changes files from Git-only to useable) and git add copy from work-tree to index (which changes useable to Git-only).

This is almost always the slowest-by-far part of dealing with commits and files, so the index keeps track of (indexes!) the work-tree, using OS-specific information. That OS-specific information, found via the OS about the work-tree, goes into the index.

If you share the work-tree and index and .git files across machines, what happens is that the index itself becomes useless, because the OS-specific work-tree data stored in the index is for the VM or the host, but never for both at the same time.

When the index is correct and describes the work-tree correctly, git status is fast and accurate. When it's not, the two diffs it must run—see my answer to the question you linked—cannot be done nearly as efficiently. If you use any kinds of file transformations, they must either be re-run, or assumed to have changed files.

The TL;DR of all of this is: Never share a Git repository this way, use the fetch and push mechanisms to share it instead. This is not because it does not work, but rather because it can work, but becomes a horrible experience. The file name case-folding issues you identified are the tip of another whole nightmare iceberg (not directly solved by not-sharing the repository, but at least possible to solve that way).

¹You can remove a commit, as long as you also remove all of its children and their children and so on. That is, removing a commit requires a sort of commit-line genocide. It's often a bad idea to do this, and if you are going to do it, you usually have to copy the entire chain of children—but sometimes it's a good idea, and in fact this is what git rebase does internally.

Note that git commit --amend does not change a commit. Instead, it just shoves aside (and thus eventually kills off and removes) the existing end-of-chain commit by creating a new replacement end-of-chain commit, using the current commit's parent as the new commit's parent.

I appreciate the time you took to write this, I do not think it addresses the question at all. Perhaps I was unclear.. if I had no unchanged files, I would still see the host reporting 0 uncommitted and the guest as multiple files. My goal is to be able to use the guest or to know why it's showing different results. — user1529413, Aug 24 '18 at 17:35
I left one thing out: the index stores the host name. If the host name changes (which it does when switching from VM to host or vice versa), the entire index becomes invalid. So Git has to check every file. If there are any differences in the way they're stored in "host format", they will all be modified! — torek, Aug 24 '18 at 18:20
Yes, but it doesn't show the hostname part (nor the internal checksums): `git ls-files --stage` or, even more verbose, `git ls-files --debug` dumps out as much as is user visible, but there are extra records that this doesn't show. — torek, Aug 24 '18 at 18:31
Here's where I'm stuck with this answer (much appreciated btw) If the host and guest have no changes at all the list of files are still different (`git add --dry-run .` produces 0 items and all items respectively . — user1529413, Aug 29 '18 at 12:39
I'd have to dive in to investigate the source code in Git with an example with a VM that does this, but in general, once the index is corrupt or wrong (which it is from *one* of the two systems' points of view), a lot of operations get weird. `git add` generally winds up throwing out the bad index and replacing it with the good one, which means it will re-add all files. It still boils down to the issue being that you should never attempt to share an index across separate machines (this includes using SMB or other shared mounts, or Dropbox, etc). — torek, Aug 29 '18 at 15:02

Why does git behave this way? Inconsistency between OS and VM accessing the same repository

1 Answers1