6

The resources below describe how to remove sensitive data from a git repository.

Afterward, how do I double-check that the naughty bits are really gone, i.e., search all blobs in the repository (be they referenced, garbage, packed, loose, or otherwise) to verify that the offending pattern has been utterly destroyed?

Does the answer change when working with a bare repository versus one with a work tree?

Community
  • 1
  • 1
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245

2 Answers2

9

According to that GitHub page, any commit may be referenced via SHA1, even if no ref points to it, so you must delete the repository and recreate it. I can verify that a commit is still visible at least two weeks after it has been dereferenced. In general, once you have removed the sensitive data — so that they are not accessible via any ref — the simplest way to prune Git’s object store is to clone the repository and destroy the old one. This is especially true if you do not have direct access to the repository such as on GitHub.

(In other words: If the garbage SHA1 is known, then GitHub will happily serve it over the web. The Git protocol will normally refuse to give you unnamed commits, but it can be enabled with the daemon.uploadarch config.)

The way to turn referenced objects into garbage objects is with judicial application of rebase, filter-branch, reflog, update-ref and the like. The way to purge garbage objects is with judicial application of gc, fsck, prune, and repack.

Example queries:

  • List dangling commits, which you may grep for sensitive data that may be garbage collected:

    git fsck --no-reflogs | awk '/dangling commit/{print $3}' | while read sha1;
      do git grep foo $sha1; done
    
  • List every single object reachable from a ref (add --walk-reflogs for reflogs instead):

    git rev-list --objects --all | while read sha path;
      do git show $sha | grep baz; done
    

Another way is to use fast-export to export the entire repository into a text-based file, which you can pick through and manipulate with any tool you want, then fast-import into a fresh repo. This is good because it doesn’t carry any garbage, and you can grep the whole archive very easily.

The answer does not change if you do not have a work tree, but commands like filter-branch may want a work tree for some use cases.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • I eliminated the offending bits by construction by way of `fast-export`, hack, `fast-import` into a new repo, and `rm -rf` the old. – Greg Bacon Feb 25 '12 at 02:38
1
git log -Sword

Where word is the string you are checking for.

How to grep Git commit diffs or contents for a certain word?

Community
  • 1
  • 1
Chris Cherry
  • 28,118
  • 6
  • 68
  • 71