How to retroactively and completely remove traces of files and folders that are added to .gitignore

Question

Please note: I have read this, this, this, and many more. They either don't answer exactly my question, or I'm not experience enough to extract my solution from them.

I have mistakenly committed sensitive information to my local git repo. Now, I've added the concerning file and folders to .gitignore. How do I remove any and every trace of these files from the repo?

I have a huge project where some sensitive information is kept in different folders across the project. Out of ignorance, I didn't add these folders to .gitignore. Now that I have done so, how can I make sure that all of these files are completely removed from git history?

The concerning files and folders follow a similar pattern, if that's of any help.

I have also done many commits since I started this project.

The concerning folders look like this in my .gitignore:

js/*/sensitiveData
python/*/sensitiveData

Is there a way to remove them while preserving the rest of the git history?

I would ideally remove all these folders/files that I added to .gitignore from git history while preserving them on my local disk and keeping my git commits.

If it's of any help, I don't have any remotes, yet. Everything is kept on my local disk.

torek · Accepted Answer · 2019-08-06T20:47:31.343

See Remove sensitive files and their commits from Git history, but—this is very important—your problem is simpler, because:

If it's of any help, I don't have any remotes, yet. Everything is kept on my local disk.

This is indeed very helpful. What you are going to do—what you must do, no matter which way you choose to do it—is to "rewrite history". History, in Git, is nothing more than the set of commits in the Git repository. Each commit saves a full and complete snapshot of every file,¹ plus some metadata like who made the commit (name and email), when (date-and-time-stamp), and why (log message). One part of the metadata specifies which commit is the previous commit: the immediate history for this one commit.

History just means: start at (all of) the last commit(s), and work backwards from each point to its previous (parent) commit(s). That's it—that's all there is to it, really. But, every commit is frozen forever: you cannot change which files it has, nor which parent commit(s) it identifies. So to "change history" you must construct a whole new history, starting from whichever commit(s) have the files you don't want them to have. From then on, every descendant has to change too: to not have the file(s), and/or to list as their immediate history, the commit(s) that don't have the files.

In a big repository with a lot of commits, this tends to amount to: Copy every commit to a new and improved commit. Then you simply switch from using the old commits to using the new ones. The old ones, being un-find-able, are eventually² cleaned up and really do go away. In the meantime, you just carry around double copies of everything—which, because of the way Git stores files, doesn't really take much extra space.

Next, although I've never actually used The BFG, I recommend considering this answer to the linked question.

Last, no matter which of the various approaches you use from Remove sensitive files and their commits from Git history, I'd recommend that you do it this way:

Copy your repository (see below for copying methods).
Apply your chosen "rewrite history" method to the copy.
Inspect the result. Is it good? If so, switch to using the copy. If not, remove the copy and start again at step 1.

If your chosen method is git filter-branch, the copy in step 1 is not actually necessary. It just makes it a lot easier for those not super-familiar with Git, because if you didn't modify the original, you can feel pretty safe just removing the attempt. The original is still there, intact.

¹Obviously, each commit really only saves a full and complete copy of every file that you saved with that commit. But that's all of your files from the last commit, plus any you added, minus any you explicitly removed.

The reason this doesn't make your repository grow immensely fat nearly instantly is that the frozen, compressed copy of a file in some previous commit can be—and is—reused in any later commit that uses the same data. This is entirely safe because all commits are frozen for all time. At most, the commit itself can be forgotten, and then eventually deleted: if some of its files are still in use by some other commit, the file data remains. The file data only goes away if no commit is using it.

²The "eventual" is based on both hidden references to commits, which are kept in each repository's reflogs, and the background cleaning process. The background cleaner only fires up when it looks, at a quick glance, profitable to do so. You can force a cleaning by running git gc yourself. The cleaner will find all references—including all hidden ones—to see which commits need to be kept, and which files are used by those to-keep commits. Commits and files and other internal objects that aren't needed any more, and are at least some particular age—14 days old by default—can then be removed for real.

Copying a repository

The simplest method is to use whatever file-tree-duplicator your system has, to copy the entire work-tree including the .git directory / folder:

cd $HOME/src
cp -r original copy

for instance. That works fine, with Git, although it also copies any random stuff that's not technically part of the repository. Note: If you have used git worktree add, it doesn't copy the added work-trees that live outside the original/ area, but neither does the other technique I'm about to show.

The other method is to use the fact that every clone of a repository, is a repository. The tricky part here is that clones don't copy a few things:

By default, none of the remote-tracking names of the original repository wind up in the clone. None of the remotes do either, so there's no sense in copying such names. You have no remotes, so this is irrelevant.
By default, the new clone has the original repository as its one and only remote. This remote is named origin. That's fine, you can remove this origin later if you want.
By default, the new clone renames all of the branches from the original repository. If the original repository has branches B1, B2, B3, and master, the new clone has origin/B1, origin/B2, origin/B3, and origin/master as its remote-tracking names.

A remote-tracking name is just Git's way of remembering: I saw this branch on some other Git! The last time I saw it, it said to use commit _____ (fill in the blank based on what this Git saw from the origin Git).

So, if you do:

git clone file://$HOME/src/original copy

then your new copy in ./copy has file://$HOME/src/original as the URL stored in its origin, and has renamed your branches from original to origin/* in copy.

The last step of the clone is to git checkout master, so that the copy now has its own master, but doesn't have its own B1, B2, and B3. So before you rewrite history in the copy, you'll want to create the branches.

You can do this pretty simply, manually, by just running:

git checkout B1
git checkout B2
git checkout B3

These commands use the same mechanism that git clone used to make master in copy based on copy's origin/master that copy got from origin (i.e., the original repository). So, now, your copy has five branches, just like your original.

(If you have a lot of branches, and need to do this often, you'll want to script it instead. But if you need to do this often, you're doing something wrong in the first place. :-) )

thank you for your thorough answer @torek will check this. My rep is still to low so my upvotes don't count. I'll accept the answer after testing. Cheers! — user1984, Aug 08 '19 at 07:14

How to retroactively and completely remove traces of files and folders that are added to .gitignore

1 Answers1

Copying a repository