1

Background

I'm currently trying to purge some sensitive secret files from a private repository in GitHub.

I found the following StackOverflow post How to remove file from Git history? - where I am told to run the following command.

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch path_to_file" HEAD

The problem is that I need a specific path_to_file. For me, this is difficult due to the age of the repository. The files that need to be deleted aren't just recently committed secrets - but files that have existed for months if not years that we only just realized are a huge liability to have in the repository itself.

Because of the length of time these sensitive files have existed in the repository, they may have been moved around to different folders/and or renamed.

Question

Given a current file path in a git repository like /Documents/GitHub/CoolApp/secret.pem.

How do I get all historical paths of this file - as a result of moving or renaming?

AlanSTACK
  • 5,525
  • 3
  • 40
  • 99
  • Have you already looked into [BFG](https://rtyley.github.io/bfg-repo-cleaner/)? – Oliver Aug 24 '21 at 06:08
  • @Oliver Doesn't BFG suffer from the same limitations as described in VonC's answer? – AlanSTACK Aug 24 '21 at 14:23
  • @AlanSTACK Yes: both `filter-branch` and BFg are obsolete, and to be replaced with `filter-repo` ([Git 2.24+, Q4 20219)](https://stackoverflow.com/a/58251653/6309) – VonC Aug 24 '21 at 14:37
  • Is there anything missing to the answer below? – VonC Aug 25 '21 at 15:16

1 Answers1

4

This is not well-supported out of the box, and is a problem also described in the more modern git filter-repo tool (which ails at replacing git filter-branch)

Example (issue 265):

Say I have this in git history (R denotes renaming/moving):

┌ 9000ef (+-) newName.txt
├ cd5678 (R)  oldName.txt => newName.txt
├ 1234ab (+)  oldName.txt

Currently, if I run git-filter-repo --path newName.txt, the new repo is going to contain only:

┌ replace/9000ef (+-) newName.txt
├ replace/cd5678 (+)  newName.txt

And the only way to make it contain the history of oldName.txt as well (that I know of) would be to run git-filter-repo --path newName.txt --path oldName.txt.
Now, there are a few problems with this, like:

There are hundreds of commits in between, and I may have hundreds of files, also I may not necessarily know if any of them had been renamed/moved in the past.
I can only search manually for each file to see if it happened in the past. This is repetitive and tiresome, not to mention that a file may have been moved multiple times.

The new tool (filter-repo) includes a --analyze option which can be of interest.
As describe in issue 25:

But even if --follow implemented following of renames for multiple files or a directory or more, that still wouldn't necessarily be sufficient because perhaps the user needs copy detection (i.e. it wasn't a file renamed from somewhere else, rather it was copied).
But with copy detection it's not as clear if you want the full history of the original; I can imagine that in some cases you would but not others.

And if we start doing either rename or copy detection, then we're moving from well-defined correct behavior to heuristics

All that said, I wanted something like that when I was using it too.
The best compromise I came up with was to have people run 'git filter-repo --analyze' beforehand, look at the renames sub-report, and pick out additional paths by hand based on that to feed to their filter-repo run.

The --analyze option still had a few caveats with the rename detection, but that was mostly fundamental to the problem.
Providing it and letting the user decide what to include (though I didn't even bother with copy detection), seemed like the best option I had available.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250