How to remove user sensitive data from Github

Question

I'm flowing the Github article "Removing sensitive data from a repository" in order to remove some sensitive data from a Github repo, but I don't know how to "force push" the ALL changes that I have done locally to Github, let me better explain that:

I created a test repo and committed some fake sensitive data, a file named fake_sensitive_data.txt that lives in the root of the project.
I started committing more files to the repo
I created a commit to remove the sensitive data from the repo
I cloned the project in a different folder
In the new cloned folder I removed fake_sensitive_data.txt from git history using the command bfg --delete-files fake_sensitive_data.txt:


Using repo : git-test-removing-sensitive-data-clean/.git

Found 7 objects to protect
Found 3 tag-pointing refs : refs/tags/v1, refs/tags/v2, refs/tags/v3
Found 5 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/origin/HEAD, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

* commit b8c88b09 (protected by 'HEAD')

Cleaning
--------

Found 11 commits
Cleaning commits:       100% (11/11)
Cleaning commits completed in 73 ms.

Updating 6 Refs
---------------

       Ref                                       Before     After   
       -------------------------------------------------------------
       refs/heads/master                       | b8c88b09 | 82104232
       refs/remotes/origin/lev/pr-to-stay-open | 2b131b17 | 0bcfb420
       refs/remotes/origin/master              | b8c88b09 | 82104232
       refs/tags/v1                            | c740754e | b8a33de1
       refs/tags/v2                            | 4abc08c8 | a0fdb11d
       refs/tags/v3                            | a448a05e | 4c4176a7

Updating references:    100% (6/6)
...Ref update completed in 18 ms.

Commit Tree-Dirt History
------------------------

       Earliest      Latest
       |                  |
       . D D D DD D D D m m

       D = dirty commits (file tree fixed)
       m = modified commits (commit message or parents changed)
       . = clean commits (no changes to file tree)

                               Before     After   
       -------------------------------------------
       First modified commit | 0cd750f6 | dedd68e8
       Last dirty commit     | 2b131b17 | 0bcfb420

Deleted files
-------------

       Filename                  Git id          
       ------------------------------------------
       fake_sensitive_data.txt | cc86c97f (199 B)


In total, 18 object ids were changed. Full details are logged here:

       git-test-removing-sensitive-data-clean.bfg-report/2020-01-24/09-22-19

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

Once the cleanup was done I force pushed things to Github using the command: git push origin --force --all && git push origin --force --tags

So these were the steps that I followed in order to wipe out the file fake_sensitive_data.txt from my repo, now the problems I'm facing:

The file still stay in ACTIVE branches.
The file still stay in COMMITS from branches that were deleted and never merged.
The file still stay in PRs that were already merged to master.

So my question is, how do I remove a file and the history from ALL branches, commits, PRs, tags, (anything) and push it to Github?

score 5 · Answer 1 · answered Jan 24 '20 at 17:42

TL;DR

You must get GitHub to do what you need. Even then, if the commits have been copied out to other repositories elsewhere, you must then get all those other copies (and the people who own them) to update their copies too.

Long

Nothing—no power on earth—can actually remove the file from the commits that contain the file. Nothing can change any existing commit, ever. Once the commit is made, it is in effect set in stone, or frozen for all time.

What the BFG and git filter-branch do instead is to make new and improved commits, by copying the commits that do have the file, to new ones that don't. (The fact that the new commits don't have the file is the improvement, in this case.)

So far, this is pretty simple. The old commits are still there, and now the new ones are there too. But you want the old ones gone. This is where everything goes awry. This is also where things get a little complicated.

The question you should ask at this point is:

How does Git find a commit in the first place?
For that matter, how does anyone find a commit? What is the true name of a commit?

You have four links above, and one of them is https://github.com/luivilella/git-test-removing-sensitive-data/tree/124e5707bf29a24cfb4167c869250fd919c42446. I'm leaving the full URL to be shown here. Note the very long string of random-looking hexadecimal digits at the end, 124e5707bf29a24cfb4167c869250fd919c42446. This is the commit's hash ID. It

This is the true name of the commit. This is how someone who has the commit can find it, reliably, every time. You just have to memorize 124eblahblah (hard) or write it down somewhere and cut-and-paste it (easy) and run git checkout hash-id and you have it out and ready to work with.

Now, every repository—including every clone of some original repository—has, in it, every commit it's ever picked up, minus any it's thrown out. Note that The BFG ended its session with advice to run:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

The git gc command is Git's ~~Grim Reaper~~ Garbage Collector. It's the housekeeping program, or more precisely, a director of individual housekeeping programs that go around looking for commits and other Git objects that nobody could find. If you can't find the object—if its hash ID isn't written down anywhere that Git can see—well, then, you obviously don't care if the Grim Collector erases it entirely.

So now we have to ask:

Where can these hash IDs be written down so that Git can see them?

For Git itself, the answer is mainly: In other commits. Every commit can list some other commits' hash IDs. If the commit with hash H lists commit hash ID G, then anyone who can find H, can use that to find G. If the commit whose hash ID is G lists the hash ID of commit F, then anyone with either H or G can find F.

If you want to draw it, draw a commit with some number of arrows coming out of it. These are the parent commit hash IDs in the commit. Most commits have exactly one. Merge commits have two, for their two parents.¹ These arrows always point backwards, to some previous commit. So if you can just find the last commit(s), you can find every commit.

This is where things like branch names (master), tag names (v1.2), and remote-tracking names (origin/master) come in. Git gives you these naming devices to find one specific commit.² With a branch name, that's the last commit that we should say is part of the branch. With any other name, that's just some hash ID, e.g., a tag can tag a particular commit as "use this commit to access version 1.2".

These names are collectively refs or references. When The BFG said:

Updating 6 Refs

that's what it was talking about. The BFG copied some particular original commits to new and improved ones. Then, having copied those, it had to copy all subsequent commits as well, either improving them too (because they had the file you want gone) or just because the old ones held the hash ID of some other old-and-bad commit that has now been improved.

Once The BFG has copied-and-improved everything that has to be improved, and copied everything else that has to be copied because of the copy-and-improvement, The BFG goes in and changes each ref appropriately.

But The BFG can only change the refs in your repository. Every Git repository that exists has its own refs. All Gits share commits (by copying) but they don't necessarily share all their refs.

Having updated your own repository's refs, The BFG now recommends that you purge your Git's reflogs, which hold logs of what the ref hash IDs were (and of course Git can see all of those, so those keep the old commits live). That's the git reflog expire command. The --expire=now part says don't keep entries for 30 or 90 days: remove them all now. Then, The BFG recommends that you run the housekeeping git gc program. The --prune=now removes the standard 14-day grace period that Git uses so that background git gc operations won't remove an object that some other Git command is in the middle of making.³

So, after this step, your repository no longer has the "bad" commits. If you try to git checkout hash, your Git will say: I don't seem to have that hash ID in my object database. It's gone! Everything is fine.

But that's your Git repository. So now you use git push origin --force: this has your Git call up another Git—the one over at GitHub—and give them any new objects (commits and internal objects) they'll need, such as the new and improved ones that The BFG made. Then your Git sends forceful commands: For branch name master, set that branch name to remember commit X! For tag name v1.2, set that tag name to remember commit Q! and so on.

If they obey (which they will if you have the right permissions), now GitHub's Git can only find those commits through those names. Those commits can find earlier commits, and so on. But GitHub's Git hasn't removed the other commits. They'll do that when their Git gets around to running git gc, whenever that is. Moreover, they may have ref names that they never told you about.

The ones you have mentioned here are pull requests. GitHub implement pull requests by setting special GitHub-only names, refs/pull/*. They copy these names into other GitHub-side repositories when appropriate, according to all the rules that make GitHub work. But they don't let you set them or delete them. See also Delete a closed pull request from GitHub.

So: you must contact GitHub support and get them to delete any PRs that keep the "bad" commits alive. You must have them force their Git to run the appropriate git gc to discard commits before the default maintenance window passes, too. Only then will the URLs that refer to these PRs, or to commits by hash ID, stop working. And of course, you must remember that anyone who can clone or access your GitHub repository may have copied those commits to their own repository by now, and may have your data: and the only way to get them to give it up is to go to them, whoever they are.

¹Some merge commits, which Git calls octopus merges, can have more than two parents. The arrows all still necessarily point backwards.

²Tag names can point directly to other Git internal objects, such as trees or blobs. Trees are how Git stores the names of files that go with a commit, and blobs are how Git stores the files' content—the data for each file. A tag name can also point to the last of Git's internal object types, which is the annotated tag object. The annotated tag object contains the hash ID of some previously-existing object, plus of course the annotations.

³When Git is building new commits or other data, the way it does so is greatly simplified by this grace period. Git can just create objects left and right, getting new hash IDs that only that one program has at the moment: none are saved anywhere and none of those objects can be found. Then, at the end, when everything is ready, the object-creator writes the most important hash ID—the one for the last commit in the branch, for instance—into some ref. Now the objects are all findable, and the process is complete.

Should something go wrong—Git discovers that some commit can't be made for some reason, for instance—the object-the creation program can simply exit immediately. Any objects it made that are unused will sit around for the grace period, and then the next run of git gc, whenever that is—Git runs it automatically for you so that you don't need to think about it—will find and remove the leftover junk.

How to remove user sensitive data from Github

1 Answers1

TL;DR

Long