There are lots of answers about how to remove sensitive commits, e.g., Remove sensitive files and their commits from Git history. Any good answer warns you that it's probably too late anyway, which is true. Not too many go into details about when and why it is too late, but the answer is pretty straightforward: it's not not a lot of use. The rest of this answer is about when and why it's too late, and why just deleting the commit with an interactive rebase is not sufficient.
The heart of the problem is that commits cannot be changed, and Git is wired to add new commits. Removing old / dead commits (and other dead objects) happens as a side effect, with little control on your part. When you do virtually anything—no matter what: git commit --amend
, git rebase -i
, git reset --hard
, none of this matters—any existing commit remains in your database of commits, unchanged, undisturbed, and still available by its hash ID. Nonetheless, it is possible to remove a commit for real. It's just hard to do it in a controlled and correct manner.
Representing and finding commits
Each commit—in fact, each object1 in Git's main database—is accessed by its hash ID. The hash ID of the last commit in a branch is found in a second, smaller database. Essentially, a branch name like master
says: the tip commit of master
is a123456...
, which provides the hash ID of the commit object so that you—or Git—can go back to the main database and say: Get me object a123456...
.
Every commit can list the hash ID(s) of some previous, or parent, commits. That is, having obtained object a123456...
, you can then fish around inside it for the parent hash IDs. If the (single) parent hash ID of a123456...
is 9876543...
, you then go back to the main database and say: *Get me object 9876543...
and you have the previous commit. That's how you—and Git—can start from end of a branch and work backwards, one commit at a time:
... <-grandparent <-parent <-last-commit <--branchname
If we use single uppercase letters to stand in for hash IDs, and just remember that the arrows (from child to parent) always point backwards, we get something that's easier to draw when you have multiple branches:
...--E--F--G <-- master
\
H <-- develop
But in all cases, whenever you do something to "change" your history—for instance, if we decide commit G
is bad and must be replaced—you don't actually change anything. Instead, Git in effect just moves the bad commit out of the way:
G
/
...--E--F--I <-- master
\
H <-- develop
The main object database is not cleaned out immediately, and if you have any way to remember the hash ID of commit G
, you can ask Git for G
by that hash ID. Git will present it to you, because it's in the database!
This same description is true regardless of how you "delete" or "change" a commit: Git just ends up making copies of every other commit, so that the "deleted" or "changed" commit (here, G
is to be deleted) is now on a different branch-line:
...--o--F--G--H--J--... <-- branch
becomes:
G--H--J--... [previous branch, now abandoned]
/
...--o--F--H'-J'-... <-- branch
where H'
is a copy of H
adapted to come after F
instead of G
, J'
is a copy of J
adapted to come after H'
, and so on. Again, G
isn't really gone, it's just shoved up out of the way, along with all of its descendants. All of its descendants are replaced by slightly-altered copies, with new, different hash IDs.
1There are four types of objects. Commit, tree, and blob objects work together to store files in commits, with annotated tag objects making the fourth type. Each commit refers to one tree; that tree refers to additional sub-trees if needed, and to blobs to hold the files that go with that commit.
Removing commits
So, when—and how and why—do commits eventually go away? The answer is that Git has a maintenance command, git gc
, whose job is to walk the entire main database of every object, while also walking the other database of all names by which one can find objects. If there is no name by which we can find commit G
, after an operation like the above, git gc
will determine that this is the case, and—eventually—kick G
out of the main database, using whatever the operating system's normal deletion functions are for deleting a file.2
More formally, for git gc
to delete an object from the main database, the object must be unreachable. For a pretty through discussion of the notion of reachability, see Think Like (a) Git. Unfortunately for your particular use case, the set of names by which we can reach commits includes any commit in any reflog.
2Typically this is an insecure delete, so that if you have control of the underlying storage media, you can still get the data back that way, but now it's obviously that much harder. In any case, now no one can just ask that Git repository for commit G
by hash ID. Beware of file systems that support snapshots, though: you can just wind back to a previous snapshot and recover the entire repository as it was at the time of the snapshot!
Finding commits part 2: the reflogs
There is a reflog for every branch name, such as master
, plus one for HEAD
. (There are probably additional reflogs, but these are the two important ones here.) In the example above, commit G
is no longer reachable from the name master
, but there are still two reflog entries, master@{1}
and HEAD@{1}
, both of which server to find commit G
. So git gc
will not delete commit G—not yet, anyway.
The reflog entries that do find G
will be deleted, eventually. In particular, git reflog expire
automatically deletes sufficiently-old and therefore expired reflog entries. How old is sufficiently old is something you can configure, but it defaults to either 30 or 90 days,3 and in this case, to 30 days.
What this means is that by default, G
will stick around until a git gc
uses git reflog
to delete the reflog entries, once they're sufficiently old—i.e., at least 30 days from now. You can use git reflog
(see the documentation) to delete or expire the entries for G
sooner, if you want to speed that part up; or see cloning below.
Once the reflog entries are gone, so that G
really is (globally) unreachable, git gc
will remove it. You can tell that this has happened because git show hash
and git rev-parse hash
will tell you that they have no idea what hash ID you are talking about.
Remember, too, that if your Git has contacted another Git, your Git may have given that other Git commit G
. In particular, when you run git push
, you have your Git call up another Git and feed them commits. If you've given them commit G
, nothing you do in your own repository can take that back. If you allow other users to git fetch
from your repository, they may have taken a copy of G
, and again, nothing you do in your own repository can take that back: you must convince them to discard the commit.
Reflogs are not copied by git clone
, so another way to get rid of G
, without waiting, is to clone your own repository. What git clone
does is to make a new repository, then fetch from the original repository. The commits a fetch gets are those that are reachable from the names the source repository exposes. So, rather than manually expiring some reflog entries and then running git gc
, you can just clone your own repository. There's a drawback here: you lose the safety net of all of your reflogs, and your own branch names become your new repository's origin/*
names.4
3The choice between 30 and 90 days here depends on whether the value in the reflog is reachable from the commit to which the reference itself points. In this case, the name master
points to commit I
, for instance, and it's not possible to walk back from I
to G
, so the value in master@{1}
, which points to G
, is not reachable from the value on master
. This means the expiration is gc.reflogExpireUnreachable
—the one that defaults to 30 days—rather than gc.reflogExpire
, which defaults to 90 days.
Note that once again, we depend on the concept of reachability through a directed graph. This is one of the keys to understanding Git.
4You can use git clone --mirror
, but that gets you a bare repository, and one with an inappropriate default fetch
setting. You can then fix those two, but if you know how to do all this, you'll probably want to use something other than --mirror
anyway.
Summary
If:
- you haven't shared the unwanted commits with anyone (no fetches or pushes), and
- you remove all references to the commits, or wait 30 days, and then run
git gc
then the commit will be truly gone, absent any sort of resurrection through file-system level snapshots. You can feed the hash ID to git show
or git rev-parse
to verify that it is gone. But if the commit might have been copied anywhere else, you no longer have any control over that.
The safe default is to assume that if the commit was visible to anyone else for any period of time, it has been copied, and the secrets that were in it, are no longer secret.