Cannot prevent git from modifying files after rm --cached

Question

I am attempting to run 'git rm -rf --cached .' along with 'git add .' to remove cached files that are now listed in the .gitignore. I use Visual Studio on a windows computer, and prefer to leave line endings just as they are for this particular situation.

I tried setting core.autocrlf to false using git config command. I tried creating a .gitattributes with the line '* -text', rm'ing the .git/index, and running git reset. So far, every time I add the files back, I get a huge list of modified files.

EDIT: The change in the files is not actually line endings, it is changes in file permissions which I did not request.

torek · Answer 1 · 2018-09-12T00:41:41.237

Edit: the remaining problem is that the file modes are apparently not stored properly in Windows systems (see also What is git's "filemode"?). To save and restore them, one will need a script, plus the original data:

git ls-files --stage > /tmp/original

To recover the modes, this rather crude pipeline should work:

< /tmp/original \
awk -F$'\t' '/^100755 / { print "git update-index --chmod=+x \"" $2 "\"" }' |
sh

This will attempt to chmod +x files that have been removed by the below sequence, so you can expect some error messages if there are any such files. (It also assumes no files have double quotes in their names.)

Assuming you do not already have a .gitattributes file, here is a six step process that should work:

Create that .gitattributes file just as you did
Run rm .git/index
Run git checkout HEAD -- .
Run git rm -r --cached .
Run git add .
Run git rm .gitattributes (you can leave this until after verifying that it all worked). Run git commit afterward.

I do not have (nor use) Windows so cannot test this, but here's the theory behind why it should work, and hence why there are these steps.

Git's actual data storage format is a special, Git-only, compressed (sometimes highly compressed) format. Files stored in this format are mainly useful only to Git itself. This format stores a raw, uninterpreted byte stream: files do not have to be separated into "text" and "data" and so on, they are just raw byte streams (hence treated as "data" / "non-text"). The data, once stored, are read-only and get assigned a hash ID (currently SHA-1 though a future Git may use SHA-256). Git calls a file stored this way a blob, which is a term stolen from the database world.

Your computer's useful-file-storage format is of course different, and may (and does on Windows) make a distinction between "text" and "data". Text may have encodings (such as ISO-8859-1, UTF-8, UTF-16, and so on). These files are generally both readable and writable and anything on your computer can deal with them (to some degree anyway, depending on encoding).

Git has to extract files from commits, turning them from blobs into files that you can work with. These files live in your work-tree. You work with them, and then git add them to give Git a chance to re-blob-ize them.

In between these special Git-only blobs and the work-tree, Git needs a place to store the blobbed data, that—unlike a commit—is writable, but that—like a commit—has the file in the special Git-only format. This "in between" place is Git's index. Various bits of Git documentation sometimes call this the staging area or the cache.

Git uses the index copy of each file (or blob, really) to make new commits. When you run git add, Git reads the work-tree file, encodes it down into the blob form, and saves it—well, its hash ID, really—in the index. When you run git commit, Git simply freezes the index copies into committed copies.

When you run git checkout to switch to some commit, Git extracts the commit into the index (filling in all the blob hash IDs), and also extracts the blobs into the work-tree so that they are in useful format and you can work on them. When you run git add, Git compresses the work-tree file into its blob format and replaces the index entry for the file.

Transforming a blob into a work-tree file, or vice versa, is the ideal place where Git will do any conversions you need, such as turning newlines into CRLF line endings. So that's where Git does it: git checkout fills the index and expands-and-converts into the work-tree, and git add compresses-and-un-converts from the work-tree into the index, ready for the next git commit. (Any files you don't touch, stay compressed and ready to go, safely tucked away in the index.)

You already know that a tracked file is one that is in the index, and an untracked file is one that is in the work-tree but not in the index. Your goal is to use the existing .gitignore to make files that are currently in the index go away from the index if they would be .gitignore-ed. The process you are using is:

git rm -r --cached .: remove everything from the index, so that the entire work-tree is untracked
git add .: produce all new blobs in the index from whatever is in the work-tree, while ignoring any file that is listed in .gitignore.

The issue here is that what's in the work-tree has been converted by the "blob to work-tree" conversions, and will be "un-converted" by the "work-tree to blob" conversions. Creating a .gitattributes file with * -text tells Git: The conversions to do are no conversions at all."

Unfortunately, it's too late: the git checkout you ran earlier, to get this commit into the work-tree, already did some conversions.

So here, we use step 1 to create a .gitattributes file that says do no conversions. Step 2, rm .git/index, removes the index entirely. Git now has no idea what's actually in the work-tree. This step may be unnecessary but I use it to force Git to act in step 3, which tells Git: extract every file from the HEAD commit into the index and the work-tree. This re-creates the index, and re-fills the work-tree, this time doing no conversions.

Steps 4 and 5 are just as before, but this time, the work-tree files all match the blobs in the HEAD commit since step 3 operated with the .gitattributes directive in place. Step 6 is to make sure you do not commit the "do no conversions" directive.

Thank you for the detailed explanation. It did not work, of course, but I did learn some things about git reading your answer. — Carl Shiles, Sep 12 '18 at 00:01
Interesting. What does Git think is different at this point? `git diff --cached` and `git diff` will compare the `HEAD` commit vs the index content, and the index vs the work-tree, respectively. — torek, Sep 12 '18 at 00:06
Ah, *git diff --cached* revealed what has got to be the problem, file permissons. Lines and lines of :old mode 100755 new mode 100644. I am using the git bash shell, if that makes any difference. — Carl Shiles, Sep 12 '18 at 00:10
Aha. I'm not sure if or how Windows stores "should / should not be executable" (in the work-tree) nor how Git checks these (but it has `core.filemode` which is `false` to mean "don't check"), but you can use `git update-index --chmod=` on these files. The setting is `+x` to go from `100644` to `100755` and `-x` to go the other way. — torek, Sep 12 '18 at 00:16
Yea - unfortunately there's upwards of fifty thousand files. I wasn't planning on writing a perl script today :( - I think that filemode is what I need actually.. (Ah, it is already set to false. Lovely....) — Carl Shiles, Sep 12 '18 at 00:18
You can automate it, if the modes are important: save the initial modes of all files somewhere (via `git ls-files -s > /tmp/whatever`), and then run `git update-index --chmod=+x` on all remaining files for which the original mode was `100755`. Easy enough in bash, `grep 100755`, use `cut` or `awk` to extract file names after the literal tab in the `ls-files` output, and run that through `sed` to build the commands to do the chmod's. A bit harder to toss names that don't exist but you can just chmod everything and ignore failures. — torek, Sep 12 '18 at 00:22
Its not that the modes are important, its that core.fileMode is already set to false! This is ridiculous. — Carl Shiles, Sep 12 '18 at 00:57
see edit for a trick to save and then restore file modes (I've assumed that after `git add .` everything shows up as `100644`). — torek, Sep 12 '18 at 01:00
Keep in mind I'm on one of those dumb Windows machines.. I'm going to do a fresh clone on a *nix computer and see if I can get this done without all this headache. If needed, I will run your awk script. Thanks for all the help — Carl Shiles, Sep 12 '18 at 01:05

Cannot prevent git from modifying files after rm --cached

1 Answers1