4

I'm trying to remove sensitive data like passwords from my Git history. Instead of deleting whole files I just want to substitute the passwords with removedSensitiveInfo. This is what I came up with after browsing through numerous StackOverflow topics and other sites.

git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"

When I run this command it seems to be rewriting the history (it shows the commits it's rewriting and takes a few minutes). However, when I check to see if all sensitive data has indeed been removed it turns out it's still there.

For reference this is how I do the check

git grep aSecretPassword1 $(git rev-list --all)

Which shows me all the hundreds of commits that match the search query. Nothing has been substituted.

Any idea what's going on here?

I double checked the regular expression I'm using which seems to be correct. I'm not sure what else to check for or how to properly debug this as my Git knowledge quite rudimentary. For example I don't know how to test whether 1) my regular expression isn't matching anything, 2) sed isn't being run on all files, 3) the file changes are not being saved, or 4) something else.

Any help is very much appreciated.

P.S. I'm aware of several StackOverflow threads about this topic. However, I couldn't find one that is about substituting words (rather than deleting files) in all (ASCII) files (rather than specifying a specific file or file type). Not sure whether that should make a difference, but all suggested solutions haven't worked for me.

Marc
  • 1,174
  • 3
  • 12
  • 28
  • Can you not simply remove all references to your password from the codebase and change the passwords? In other words, you first should address the issue of having such password in your code base to being with. If you get them out of the codebase and then change the passwords, what do you care if the old passwords still exist in the history? – Mike Brant Nov 09 '13 at 00:44
  • I could in theory, but it's not practical in my situation. – Marc Nov 09 '13 at 02:02

2 Answers2

12

git-filter-branch is a powerful but difficult to use tool - there are several obscure things you need to know to use it correctly for your task, and each one is a possible cause for the problems you're seeing. So rather than immediately trying to debug them, let's take a step back and look at the original problem:

  • Substitute given strings (ie passwords) within all text files (without specifying a specific file/file-type)
  • Ensure that the updated Git history does not contain the old password text
  • Do the above as simply as possible

There is a tailor-made solution to this problem:

Use The BFG... not git-filter-branch

The BFG Repo-Cleaner is a simpler alternative to git-filter-branch specifically designed for removing passwords and other unwanted data from Git repository history.

Ways in which the BFG helps you in this situation:

  • The BFG is 10-720x faster
  • It automatically runs on all tags and references, unlike git-filter-branch - which only does that if you add the extraordinary --tag-name-filter cat -- --all command-line option (Note that the example command you gave in the Question DOES NOT have this, a possible cause of your problems)
  • The BFG doesn't generate any refs/original/ refs - so no need for you to perform an extra step to remove them
  • You can express you passwords as simple literal strings, without having to worry about getting regex-escaping right. The BFG can handle regex too, if you really need it.

Using the BFG

Carefully follow the usage steps - the core bit is just this command:

$ java -jar bfg.jar  --replace-text replacements.txt  my-repo.git

The replacements.txt file should contain all the substitutions you want to do, in a format like this (one entry per line - note the comments shouldn't be included):

PASSWORD1 # Replace literal string 'PASSWORD1' with '***REMOVED***' (default)
PASSWORD2==>examplePass         # replace with 'examplePass' instead
PASSWORD3==>                    # replace with the empty string
regex:password=\w+==>password=  # Replace, using a regex

Your entire repository history will be scanned, and all text files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Community
  • 1
  • 1
Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
  • 2
    Nice - worked well... Also, I never saw an OSS project with a political message printed upon running it. – Joe J Jun 16 '17 at 03:06
  • Hey @Roberto, Please help! after I complete the steps(last being git push) to replace the passwords in my git repo history (they are not anymore in the current version - both in upstream/fork, only in history) , and create a PR - I see lot of unrelated changes in the PR - where its trying to update the other files (which does not have password). is this expected or I am missing something ( I dont want to merge this PR with so many changes). It should not be updating only the required files(where password is replaced with ***REMOVED***, I followed this https://rtyley.github.io/bfg-repo-cleaner/ – lowLatency Oct 18 '19 at 23:51
  • @Roberto - just want to add few more details, the result looks good after `$ bfg --replace-text passwords.txt my-repo.git`, as it shows only those files which has password.but after git push and creating the PR - it shows hell lot of other files in the PR including Readme.md file. I am kind of trying to do a quick POC, before I do it for all the passwords and on couple of other repos, looking forward to get help! – lowLatency Oct 18 '19 at 23:58
2

Looks OK. Remember that filter-branch retains the original commits under refs/original/, e.g.:

$ git commit -m 'add secret password, oops!'
[master edaf467] add secret password, oops!
 1 file changed, 4 insertions(+)
 create mode 100644 secret
$ git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
Rewrite edaf467960ade97ea03162ec89f11cae7c256e3d (2/2)
Ref 'refs/heads/master' was rewritten

Then:

$ git grep aSecretPassword `git rev-list --all`
edaf467960ade97ea03162ec89f11cae7c256e3d:secret:aSecretPassword2

but:

$ git lola
* e530e69 (HEAD, master) add secret password, oops!
| * edaf467 (refs/original/refs/heads/master) add secret password, oops!
|/  
* 7624023 Initial

(git lola is my alias for git log --graph --oneline --decorate --all). Yes, it's in there, but under the refs/original name space. Clear that out:

$ rm -rf .git/refs/original
$ git reflog expire --expire=now --all
$ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)

and then:

$ git grep aSecretPassword `git rev-list --all`
$ 

(as always, run filter-branch on a copy of the repo Just In Case; and then removing original refs, expiring the reflog "now", and gc'ing, means stuff is Really Gone).

torek
  • 448,244
  • 59
  • 642
  • 775
  • I followed your instructions exactly and the problem persists. Any suggestions on how to debug this in a way I can find out where it's going wrong? I'm starting to get the feeling the regex doesn't match properly, even though it's a very basic one. – Marc Nov 09 '13 at 01:55
  • You might also have tags or other references hanging on to the "pre-filtered" commits. Look for which commits `git grep` finds and see what references lead to them. You could also check out the problematic revisions (even before using filter-branch) and try a manual `find ...` to make sure the sed is doing what you wanted to the files in question. – torek Nov 09 '13 at 02:05
  • Yeah you're probably right with regards to there being other tags/references. I thought the commands would filter them too, but apparently I was wrong. In the ended I went with Roberto's BFG solution which worked flawlessly. – Marc Nov 09 '13 at 20:43