Will a git interactive rebase that deletes a commit truly remove exposure of API keys / secrets / passwords?

Question

Its important to NOT store passwords and secrets in code repos.

Sometimes we hard code an API password while we are developing an application. We remove it, often by turning it into an environmental variable that we set with export (Unix). Obviously a better practice would be to use environmental variables from the start.

But what happens in the case where we are not that careful and we COMMIT that change that has the password exposed.
The first step is to quickly remove them and commit and push that change.
OK

But...

The password is still in the git history so anyone who has access to the git repository can get the pw. Not good.

But...

We then do a git interactive rebase and delete (not squash) the offending commit = the one with the password added in history.

Will that fix the problem and ensure the password is no longer available in any way in git?

How will this affect the code when I pull this commit out. If there is other code than the line(s) with the password(s) presumably I will need to redo those changes which would be lost. If the commit is many ago I could imagine problems if any commit since has also changed the same line. Hopefully not.

Possible duplicate of [Remove sensitive files and their commits from Git history](https://stackoverflow.com/questions/872565/remove-sensitive-files-and-their-commits-from-git-history) — phd, May 29 '19 at 12:03
https://stackoverflow.com/search?q=%5Bgit%5D+remove+sensitive+files — phd, May 29 '19 at 12:03

score 6 · Accepted Answer · answered May 30 '19 at 00:40

There are lots of answers about how to remove sensitive commits, e.g., Remove sensitive files and their commits from Git history. Any good answer warns you that it's probably too late anyway, which is true. Not too many go into details about when and why it is too late, but the answer is pretty straightforward: it's not not a lot of use. The rest of this answer is about when and why it's too late, and why just deleting the commit with an interactive rebase is not sufficient.

The heart of the problem is that commits cannot be changed, and Git is wired to add new commits. Removing old / dead commits (and other dead objects) happens as a side effect, with little control on your part. When you do virtually anything—no matter what: git commit --amend, git rebase -i, git reset --hard, none of this matters—any existing commit remains in your database of commits, unchanged, undisturbed, and still available by its hash ID. Nonetheless, it is possible to remove a commit for real. It's just hard to do it in a controlled and correct manner.

Representing and finding commits

Each commit—in fact, each object¹ in Git's main database—is accessed by its hash ID. The hash ID of the last commit in a branch is found in a second, smaller database. Essentially, a branch name like master says: the tip commit of master is a123456..., which provides the hash ID of the commit object so that you—or Git—can go back to the main database and say: Get me object a123456....

Every commit can list the hash ID(s) of some previous, or parent, commits. That is, having obtained object a123456..., you can then fish around inside it for the parent hash IDs. If the (single) parent hash ID of a123456... is 9876543..., you then go back to the main database and say: *Get me object 9876543... and you have the previous commit. That's how you—and Git—can start from end of a branch and work backwards, one commit at a time:

... <-grandparent <-parent <-last-commit   <--branchname

If we use single uppercase letters to stand in for hash IDs, and just remember that the arrows (from child to parent) always point backwards, we get something that's easier to draw when you have multiple branches:

...--E--F--G   <-- master
         \
          H  <-- develop

But in all cases, whenever you do something to "change" your history—for instance, if we decide commit G is bad and must be replaced—you don't actually change anything. Instead, Git in effect just moves the bad commit out of the way:

          G
         /
...--E--F--I   <-- master
         \
          H  <-- develop

The main object database is not cleaned out immediately, and if you have any way to remember the hash ID of commit G, you can ask Git for G by that hash ID. Git will present it to you, because it's in the database!

This same description is true regardless of how you "delete" or "change" a commit: Git just ends up making copies of every other commit, so that the "deleted" or "changed" commit (here, G is to be deleted) is now on a different branch-line:

...--o--F--G--H--J--...   <-- branch

becomes:

          G--H--J--...   [previous branch, now abandoned]
         /
...--o--F--H'-J'-...   <-- branch

where H' is a copy of H adapted to come after F instead of G, J' is a copy of J adapted to come after H', and so on. Again, G isn't really gone, it's just shoved up out of the way, along with all of its descendants. All of its descendants are replaced by slightly-altered copies, with new, different hash IDs.

¹There are four types of objects. Commit, tree, and blob objects work together to store files in commits, with annotated tag objects making the fourth type. Each commit refers to one tree; that tree refers to additional sub-trees if needed, and to blobs to hold the files that go with that commit.

Removing commits

So, when—and how and why—do commits eventually go away? The answer is that Git has a maintenance command, git gc, whose job is to walk the entire main database of every object, while also walking the other database of all names by which one can find objects. If there is no name by which we can find commit G, after an operation like the above, git gc will determine that this is the case, and—eventually—kick G out of the main database, using whatever the operating system's normal deletion functions are for deleting a file.²

More formally, for git gc to delete an object from the main database, the object must be unreachable. For a pretty through discussion of the notion of reachability, see Think Like (a) Git. Unfortunately for your particular use case, the set of names by which we can reach commits includes any commit in any reflog.

²Typically this is an insecure delete, so that if you have control of the underlying storage media, you can still get the data back that way, but now it's obviously that much harder. In any case, now no one can just ask that Git repository for commit G by hash ID. Beware of file systems that support snapshots, though: you can just wind back to a previous snapshot and recover the entire repository as it was at the time of the snapshot!

Finding commits part 2: the reflogs

There is a reflog for every branch name, such as master, plus one for HEAD. (There are probably additional reflogs, but these are the two important ones here.) In the example above, commit G is no longer reachable from the name master, but there are still two reflog entries, master@{1} and HEAD@{1}, both of which server to find commit G. So git gc will not delete commit G—not yet, anyway.

The reflog entries that do find G will be deleted, eventually. In particular, git reflog expire automatically deletes sufficiently-old and therefore expired reflog entries. How old is sufficiently old is something you can configure, but it defaults to either 30 or 90 days,³ and in this case, to 30 days.

What this means is that by default, G will stick around until a git gc uses git reflog to delete the reflog entries, once they're sufficiently old—i.e., at least 30 days from now. You can use git reflog (see the documentation) to delete or expire the entries for G sooner, if you want to speed that part up; or see cloning below.

Once the reflog entries are gone, so that G really is (globally) unreachable, git gc will remove it. You can tell that this has happened because git show hash and git rev-parse hash will tell you that they have no idea what hash ID you are talking about.

Remember, too, that if your Git has contacted another Git, your Git may have given that other Git commit G. In particular, when you run git push, you have your Git call up another Git and feed them commits. If you've given them commit G, nothing you do in your own repository can take that back. If you allow other users to git fetch from your repository, they may have taken a copy of G, and again, nothing you do in your own repository can take that back: you must convince them to discard the commit.

Reflogs are not copied by git clone, so another way to get rid of G, without waiting, is to clone your own repository. What git clone does is to make a new repository, then fetch from the original repository. The commits a fetch gets are those that are reachable from the names the source repository exposes. So, rather than manually expiring some reflog entries and then running git gc, you can just clone your own repository. There's a drawback here: you lose the safety net of all of your reflogs, and your own branch names become your new repository's origin/* names.⁴

³The choice between 30 and 90 days here depends on whether the value in the reflog is reachable from the commit to which the reference itself points. In this case, the name master points to commit I, for instance, and it's not possible to walk back from I to G, so the value in master@{1}, which points to G, is not reachable from the value on master. This means the expiration is gc.reflogExpireUnreachable—the one that defaults to 30 days—rather than gc.reflogExpire, which defaults to 90 days.

Note that once again, we depend on the concept of reachability through a directed graph. This is one of the keys to understanding Git.

⁴You can use git clone --mirror, but that gets you a bare repository, and one with an inappropriate default fetch setting. You can then fix those two, but if you know how to do all this, you'll probably want to use something other than --mirror anyway.

Summary

If:

you haven't shared the unwanted commits with anyone (no fetches or pushes), and
you remove all references to the commits, or wait 30 days, and then run git gc

then the commit will be truly gone, absent any sort of resurrection through file-system level snapshots. You can feed the hash ID to git show or git rev-parse to verify that it is gone. But if the commit might have been copied anywhere else, you no longer have any control over that.

The safe default is to assume that if the commit was visible to anyone else for any period of time, it has been copied, and the secrets that were in it, are no longer secret.

Will a git interactive rebase that deletes a commit truly remove exposure of API keys / secrets / passwords?

1 Answers1

Representing and finding commits

Removing commits

Finding commits part 2: the reflogs

Summary