Unable to rm directory from git --cached

Question

I accidentally added a directory to my repo. No big deal I'll just run

git rm --cached <dir-name>

but I get this error

error: the following file has staged content different from both the
file and the HEAD:
    <dir-name>
(use -f to force removal)

What may complicate this a bit is that the directory added to the cache is a git repo itself and the only cached changes in the root repo has to do with the git Subproject.

git diff --cached

output:

--- /dev/null
+++ b/<dir-name>
@@ -0,0 +1 @@
+Subproject commit <commit-id>

The commit id referenced is the current nested repo HEAD, but I may have added to it since the accidental add.

My first instinct was just to use -f to force since this is not in the actually repo but only staged. But this answer made me think twice as I do not want to permanently mess anything up.

Luckily these are entirely local so I'm not dealing with any remotes, and I'm sure there is an easy fix to this, but I just want to do it correctly. Should I just run git rm --cached -f <dir-name>? or do I need to take another approach?

score 3 · Accepted Answer · answered Sep 01 '20 at 20:38

TL;DR

Your first instinct was almost certainly correct: you probably want git rm --cached -f path, where path names the path to the sub-repository. This will remove the thing that Git calls a gitlink from the index / staging-area / cache.

Long

First, remember that Git does not store directories at all. So this directory is not in Git in the first place. The reason why has to do with what Git calls, variously, the index, the staging area, or—relatively rarely now but still visible in git rm --cached—the cache.

Second, know that Git never stores another Git repository in a repository. Or, to put it another way, repositories never actually nest. The actual implementation here is to forbid any path name components that consist of .git (including case-insensitive variants such as .GIT or .Git or whatever).¹

What you have here is what Git calls a submodule (or perhaps half a submodule: the half that Git calls, internally, a gitlink).

¹In very old versions of Git, the authors forgot to account for case-insensitive file systems on Windows and MacOS, and allowed creating repositories with files named, e.g., foo/.GIT/HEAD and the like. This made the "outer" Git treat the foo/.GIT directory as another Git repository. This made it far too easy to set up Trojan horse repositories as traps for those using these systems.

Commits

Git is ultimately built out of two key-value databases, one of which is copied by cloning. (The other, which holds branch and tag and other such names, is partly copied but modified during cloning.) The main database consists of commits and other internal Git objects. Each of these objects is read-only, because the way Git finds these objects is by its key, and its key is itself a cryptographic checksum of the object. If you take an object out of the database, manipulate some of its bits, and then try putting it back, what you get is not a modified object, but rather a new object, with a new and different key.²

The most interesting object for our discussion here is the commit. A commit contains a snapshot of all the files that Git knows about.

²This makes the assumption no key will ever repeat unless the value itself is a duplicate. (This duplicate value = same key trick is how Git de-duplicates file content.) Git currently uses SHA-1, which is good enough in practical terms, but is susceptible to deliberate attacks. The consequences of such an attack are mostly just nuisances, fortunately. For more about this, see How does the newly found SHA-1 collision affect Git?

The index

Git builds new commits by first storing, in something it calls the index,³ a series of records giving path names and hash IDs for Git objects—mostly blob objects that will store those files' contents. There is no record-type that will hold a directory, and this is why Git cannot store directories.

The git commit command simply packages up the index's records⁴ and wraps the package with a commit object, so as to make the new commit. So the index's function is to be the staging area: it contains the proposed next commit. Since the index is not itself a Git object, it can be modified in place as needed.

For concreteness, the actual records—ignoring headers and extensions and just concentrating on the index's normal everyday file entries—consist of:

a mode based on Unix-style inode mode fields;
a path name;
a hash ID giving an internal Git object ID; and
other cache data I'll ignore here.

The mode is 100644 or 100755 for ordinary files—you will see these often in git diff output—with other mode values reserved for symbolic links and gitlinks. The path name contains any slashes needed: files here can have long names such as path/to/file.txt. That's not a directory path that contains a sub-directory to that contains a file named file.txt: it's literally a file whose name is path/to/file.txt.

Note that checking out some existing commit first fills in Git's index with these records as stored in that commit, then populates your working tree with actual files if / as needed.

³This is currently a single file usually named .git/index, but it can itself refer to additional files. This is a bit problematic because these additional files can't be properly protected during Git operations. Very large index files (e.g., millions of records) result in performance problems, hence the notion of a "split index", which this answer doesn't cover at all.

⁴Git turns the names into one or more internal tree objects that generally refer to more tree objects, with each slash-separated name component grouped into some sub-tree. If the index could store directory names, these tree objects would allow Git to store an empty directory—but it can't, so Git can't.

A submodule is a reference to another Git repository

This finally gets us to submodules. We know that:

a repository is a collection of commits, and
commits are identified by hash IDs.

What if we could have Git clone some other repository for us, automatically, while we are working, and then git checkout the correct commit in that other repository? This is what submodules are all about.

In order to clone a Git repository, Git needs:

a URL, and
a place to deposit the cloned repository: a path relative to this repository.

To get the "outer" or superproject Git to git clone some inner Git, we need to store this information. This stuff goes into a plain-text file, formatted like a Git configuration file, called .gitmodules.

Once the clone is made, though, we need to have the superproject Git enter the submodule and run git checkout hash or git switch --detach hash. This requires two things:

a path relative to this repository, and
a commit hash ID.

The superproject Git gets these from Git's index, which as we already saw, stores both a path name and a Git hash ID. When a commit contains a gitlink—an entity with mode 160000—the checkout operation just reads this gitlink into the index. So now Git has, in the index, a path/to/gitlink or whatever name, along with a stored commit hash ID.

This means the index stores gitlinks

Whenever you are:

in your superproject working tree (and not down within the submodule working tree), and
you run git add on a path that is a path to the submodule,

your superproject Git will add to its index, or update in its index, the appropriate gitlink entry. Note that Git does not check, at this time, whether there is an appropriate .gitmodules entry. It just updates or adds the gitlink in the superproject Git's index.

The superproject Git finds the hash ID that goes with this gitlink by cd-ing into the submodule and running git rev-parse HEAD.⁵ So that updates the gitlink entry in the index, based on whatever commit is actually checked out in the submodule.

If the .gitmodules file is missing or incomplete, this particular submodule is, well, kind of half-assed: any other clone you make of this repository won't have any idea what URL to use to run git clone to obtain the submodule. Since you mentioned that this is all entirely local, that probably does not matter for your use case.

⁵Current versions of Git literally do this, and it's not the most efficient process. New versions of Git in the pipeline have facilities to avoid starting new sub-commands, yet achieve the same result.

Conclusion

If you don't want a submodule—or a half-assed one that consists only of the saved gitlink, without the necessary stuff to git clone the submodule in the first place—you should remove the gitlink from the index. Using:

git rm --cached -f path/to/gitlink

will do that. Make sure you use the --cached option! (Fortunately, if you forget, it should just error out, I believe.)

If this was a proper submodule, you may want to do even more: see What is the current way to remove a git submodule? If it was never properly added, though, there's nothing more to do.

Thanks for such a complete answer. Much appreciated. – alrob Sep 02 '20 at 16:15 — alrob, Sep 02 '20 at 16:15