Undo git rm cached -r on a clean repository

Question

I had a folder with a Visual Studio 2019 solution in it, containing some 40-ish projects.
I wanted to add the contents to git version control.
I ran: git init to create a local repository.
I created a .gitignore file to exclude any bin, obj, etc. folders.
Since my .gitignore rules weren't affecting the status result, after navigating through various SO questions, I ended up running git rm --cached -r . which has deleted absolutely everything except for the very folders I wanted to exclude through .gitignore.

I've checked various other questions, but they all involve existing git history, which unfortunately I don't have.

I haven't touched the repo any more, for fear of making it worse. Where did the files go? Did they get unlinked but could still be recovered through some file recovery software? Any magical command which could help me here?

I did. _`fatal: ambiguous argument HEAD: unknown revision or path not in the working tree`_ — Nelladel, Aug 23 '20 at 13:36
`git rm --cached` does not affect the working tree (the visible files), so you are not describing accurately what you did. But if you said `git rm` without ever committing anything, the deletion is permanent unless you have some other kind of backup. — matt, Aug 23 '20 at 14:25
I must've ran `git add .` in between adding files to the `gitignore` and the `git rm` (see comment on the accepted answer), since the index was deleted, but not the items listed on the `gitignore` file. — Nelladel, Aug 24 '20 at 06:51

torek · Accepted Answer · 2020-08-23T21:50:57.553

First, note that Visual Studio may add some of its own quirks, about which I know nothing: this answer speaks strictly to Git.

As matt mentioned in a comment, git rm --cached does not touch your working tree. I speculate here that you must have run git rm without --cached.

You're going to want to run git fsck --lost-found. This will get you your file contents back, but not your file names. This is going to be at least somewhat painful, as you will have to manually restore each file to an appropriate name.

Below, I'll tell you what I think happened, why the above works, and what you'll need to do in some more detail.

Long

To understand what happened (given my assumption above), what you can do, and why this is the limit of what you can do using Git directly, it's important to understand how Git works here. A Git repository—that's the stuff inside the hidden .git folder—is all about commits. A repository does have branches, or more precisely, branch names, but it's really all about the commits.

Each commit stores files—that's its main purpose, to store files—and also some metadata, which gives information such as who made the commit, when, and so on. The commits themselves, and the files that are stored inside them, are strictly read-only. They are in a special Git-only format, with compression and de-duplication applied.

The de-duplication deals with the fact that each commit usually duplicates most of the files from some previous commit: by de-duplicating the files, they don't actually take any space, even though each commit has a full copy of each file. This de-duplication is quite safe because no committed file can ever be changed.

But since these committed files are read-only (and compressed and in a format that only Git itself can read), you literally can't work on the committed files. This means that the files you do work on aren't the files that are in Git. This is where your working tree comes in.

Your working tree—I like to shorten this to work-tree—holds usable copies of each file. You do your work within the work-tree. The top level of your work-tree contains the hidden .git folder, in which Git's actual repository resides.

A new, totally-empty repository has no commits (and no branch names). You would create the first commit by running git commit; this would allow branch names to exist as well. Obviously you have not done this yet either (which is OK, but will cause some pain soon).

Before Git can make a commit, though, Git needs you to copy your work-tree files into somewhere that Git itself can use directly for the next commit you will make. This is Git's index. Git also calls this the staging area, or sometimes—rarely these days—the cache. These three names all describe the same thing.

Git's index holds, in Git's compressed-and-read-only format, each of the files that Git knows about. Each index copy is in the de-duplicated format, but isn't actually stored in a commit yet. This means that these copies can be replaced with a new, Git-ified, ready-to-be-committed copy at any time.

The reason we call Git's index its staging area is that when you do make a new commit, the files that go into the new commit are precisely those files that are in Git's index. Hence the copies that are in the index are staged for commit. Once you do make a new commit, those index copies are now permanently¹ stored in that commit. There's one other thing to realize about these index copies, though: the file names in the index are only in the index itself. They have embedded slashes—converted to forward slashes even on Windows—in them, e.g., a file's name might be d1/d2/file.ext. The index cannot store folders at all, so if the index contains d1/d2/file.ext, Git itself will create folder d1 if needed, then create d1/d2 if needed, so that Git can create file.ext within d1/d2/, so that you have a file named file.ext in a folder named d2 in a folder named d1.

To add a new file to Git's index, or to replace the ready-to-commit contents of a file whose name is already in Git's index, we use the git add command. You must have run something like git add . or git add * early on. At this time, Git read through your work-tree, found every file in it, copied the name to Git's index, and copied-and-Git-ified that file's contents as an internal blob object. That set up Git's index, so that your next—or rather, first—commit was ready to be made.

¹Commits themselves are mostly permanent. If you do manage to get rid of one, the files it has saved could be lost—but if they're de-duplicated across other commits that you don't get rid of, the saved files will be retained. This all works automatically and you don't normally need to know anything about it: you can just imagine each commit as holding a full snapshot of every file.

Recap

Let me repeat the crucial stuff above, because it's very important in a moment:

The index stores the file's name, complete with prefix directory / folder name parts.
It's not until you run git commit that Git saves the names, even though Git carefully arranges to have the de-duplicated file contents ready to go, in the next commit.²

The index itself is a temporary construct, not saved forever. It lasts only until you do something that updates or replaces it. It is not copied by git clone either: only the commits (and their permanent snapshots) are copied this way.

²This is a time/space tradeoff: Git could pre-build its internal tree objects too, and have a very different index structure. If Git did this, you could probably store empty folders. But it doesn't: Git builds the tree at commit time, as if by git write-tree. The original git commit command was a script that actually ran git write-tree, saved the resulting hash ID, and used git commit-tree to make the commit that stored the tree that stored the files.

Generally, the time needed to build new tree objects is much shorter than the time needed to compress and blob-ify file contents. So Git builds the tree objects at git commit time, but builds and saves the blobs in advance, at git add time. In some rare cases (very deep trees) this can make git commit kind of slow, though it's nothing like what we used to experience in the bad old days of version control systems, before Git existed.

`git status`

At this point you ran some git status command(s). What git status does is fairly simple, but does require knowing about commits, Git's index, and your work-tree. Fortunately, you now know about these three:

You have no commits yet. For this special situation, Git uses its internal empty tree, which contains no files, as the point of reference in the next few steps.
You have set up Git's index to contain all of your files from your work-tree, so that Git's index matches your work-tree. The actual contents of the index are the files' names, internal Git blob hash IDs for the files' contents, and some cache data that you don't need to know about.
You have, in your work-tree, all of your files, in their normal everyday form.

The git status command starts by printing some information: your current branch name, for instance. We'll just skip over this part.

Next, git status compares the contents of the current commit to the contents of Git's index. Since you're in this new-repository-no-commits-yet state, Git uses the empty tree here. That makes every file in Git's index a "new file", to be committed.

Last, git status compares the contents of the index to the contents of your work-tree. These match exactly (after accounting for the Git-only format of the index copy, that is). When a file in the index matches the copy in your work-tree, Git says nothing—so since all files match, Git says nothing here.

What you see, then, is that every file is staged for commit, including the binary files you did not mean to commit. At this point, listing such a file in .gitignore—or listing its containing folder—does not do any good: the file is already in Git's index so it is going to be in the next commit.

`git rm`

What you did next is a bit of a disaster: you ran git rm -r ., without --cached.

The git rm command is meant for removing the copies of files that are in Git's index and, if you leave out --cached, the corresponding work-tree files. If you use --cached, git rm leaves the work-tree files alone.

The only files that git rm can remove are those mentioned in Git's index. It will either remove both the index copy and the work-tree copy, or the index copy only, but if some file is not in the index, git rm won't remove it from your work-tree.

As Git doesn't actually store folders, git rm sometimes doesn't remove them either. Of course, if there is a file left behind, Git literally can't remove it, because your computer requires that the folder exist to hold the file's name-within-folder. Git is usually fairly good about cleaning up folder names that Git made, but I've seen it forget to remove some now and then. Sometimes you might want to just go in and manually delete any empty ones (or use git clean -d, but be careful with git clean!).

`git fsck --lost-found`

As mentioned earlier, Git stores file contents in a special, read-only, Git-only format that Git calls a blob object. These blob objects are referred-to by commits—technically, by tree objects—and/or by Git's index. A reference to a blob makes the blob reachable, which is a technical term I won't actually define here.

The git fsck command, which isn't something you need to run in any normal situation, reads and analyzes the contents of Git's internal databases, which includes scanning through every internal Git object. An important side effect³ of this scan is that Git will find any "dangling" blob objects.⁴ Adding --lost-found to the git fsck command tells it to, in effect, resurrect such commits and blobs.

In your case, there are no commits at all, but all the files you git add-ed became blob objects. So git fsck should find, for every file that was in Git's index before the disastrous git rm -r step, a dangling blob. The fsck command will expand out the blob contents, writing it to a file named .git/lost-found/other/hash, where hash is a big ugly internal Git hash ID.

To restore your files, you will now need to look at every file in that folder. Use its contents to determine the correct file name, and rename the file (or copy the contents, but renaming the file helps reduce the number of files left to inspect) into the right place.

The files' names were only in Git's index, which has been overwritten. So only the contents can be restored mechanically, and git fsck --lost-found does that. That's why you have to recover all the file names manually. I have done this task myself, long ago (and probably with a smaller set of files), and it is no fun.

³In git fsck, this is a side effect. In git gc, which Git runs automatically for you, it's a desired effect: this is how Git trims off dead objects, including files git add-ed but replaced by newer git adds before a commit, or added and then git rm-ed and never committed, for instance.

⁴Git distinguishes between unreachable and dangling commits and blobs here, to make git fsck more usable. Since commits form chains, we can have a chain of commits that is unreachable as a whole, with all but one of these commits reachable from other commits in this same chain. The one commit with in-degree zero in the graph is the only unreachable commit, but the whole chain itself is dangling. Any blob objects referred-to through any commit in the chain that have an in-degree count that matches the appropriate number of tree objects in this chain are, by some definitions at least, reachable, but aren't reachable from outside the chain, so those too are "dangling". You don't really need to know any of this either, but if you're familiar with graph theory, it should all make sense.

One of the most amazing answers I've read. Thank you. I think I must've done a `git add .` before running the `git rm`, since I distinctively remember typing `--cached`. Nonetheless, following your guide with `git fsck` I was able to recover all of my files (albeit, scanning through the `dangling blob`s and renaming the files). Life-saver, thank you! — Nelladel, Aug 24 '20 at 06:49