If you ever have to "anonymize" a git repo not just for one user, but all users, Git 2.2 (November 2014) provides an interesting feature with the improved and enhanced git fast-export
:
See commit a872275 and commit 75d3d65 by Jeff King (peff
):
teach fast-export
an --anonymize
option:
Sometimes users want to report a bug they experience on their repository, but they are not at liberty to share the contents of the repository.
It would be useful if they could produce a repository that has a similar shape to its history and tree, but without leaking any information.
This "anonymized" repository could then be shared with developers (assuming it still replicates the original problem).
This patch implements an "--anonymize
" option to fast-export
, which generates a stream that can recreate such a repository.
Producing a single stream makes it easy for the caller to verify that they are not leaking any useful information. You can get an overview of what will be shared by running a command like:
git fast-export --anonymize --all |
perl -pe 's/\d+/X/g' |
sort -u |
less
which will show every unique line we generate, modulo any numbers (each anonymized token is assigned a number, like "User 0
", and we replace it consistently in the output).
In addition to anonymizing, this produces test cases that are relatively small (compared to the original repository) and fast to generate (compared to using filter-branch
, or modifying the output of fast-export
yourself)
Doc:
If the --anonymize
option is given, git will attempt to remove all identifying information from the repository while still retaining enough of the original tree and history patterns to reproduce some bugs.
With this option, git will replace all refnames, paths, blob contents, commit and tag messages, names, and email addresses in the output with anonymized data.
Two instances of the same string will be replaced equivalently (e.g., two commits with the same author will have the same anonymized author in the output, but bear no resemblance to the original author string).
The relationship between commits, branches, and tags is +retained, as well as the commit timestamps (but the commit messages and refnames bear no resemblance to the originals).
The relative makeup of the tree is retained (e.g., if you have a root tree with 10 files and 3 trees, so will the output), but their names and the contents of the files will be replaced.
See also Git 2.28 (Q3 2020), "git fast-export --anonymize
" learned to take customized mapping to allow its users to tweak its output more usable for debugging.
See commit f39ad38, commit 8a49495, commit 65b5d9f (25 Jun 2020), and commit d5bf91f, commit 6416a86, commit 55b0145, commit a0f6564, commit 7f40759, commit 750bb32, commit b897bf5, commit b8c0689 (23 Jun 2020) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 0a23331, 06 Jul 2020)
fast-export
: allow seeding the anonymized mapping
Helped-by: Eric Sunshine
Signed-off-by: Jeff King
After you anonymize a repository, it can be hard to find which commits correspond between the original and the result, and thus hard to reproduce commands that triggered bugs in the original.
Let's make it possible to seed the anonymization map.
This lets users either:
- mark names to be retained as-is, if they don't consider them secret (in which case their original commands would just work)
- map names to new values, which lets them adapt the reproduction recipe to the new names without revealing the originals
The implementation is fairly straight-forward.
We already store each anonymized token in a hashmap (so that the same token appearing twice is converted to the same result). We can just introduce a new "seed" hashmap which is consulted first.
This does make a few more promises to the user about how we'll anonymize things (e.g., token-splitting pathnames). But it's unlikely that we'd want to change those rules, even if the actual anonymization of a single token changes. And it makes things much easier for the user, who can unblind only a directory name without having to specify each path within it.
One alternative to this approach would be to anonymize as we see fit, and then dump the whole refname and pathname mappings to a file. This does work, but it's a bit awkward to use (you have to manually dig the items you care about out of the mapping).
git fast-export
now have:
--anonymize-map=<from>[:<to>]
:
Convert token <from>
to <to>
in the anonymized output.
If <to>
is omitted, map <from>
to itself (i.e., do not anonymize it).
Reproducing some bugs may require referencing particular commits or
paths, which becomes challenging after refnames and paths have been
anonymized.
You can ask for a particular token to be left as-is or
mapped to a new value.
For example, if you have a bug which reproduces with git rev-list sensitive -- secret.c
, you can run:
---------------------------------------------------
$ git fast-export --anonymize --all \
--anonymize-map=sensitive:foo \
--anonymize-map=secret.c:bar.c \
>stream
---------------------------------------------------
After importing the stream, you can then run git rev-list foo -- bar.c
in the anonymized repository.
Note that paths and refnames are split into tokens at slash boundaries.
The command above would anonymize subdir/secret.c
as something like
path123/bar.c
; you could then search for bar.c
in the anonymized
repository to determine the final pathname.
To make referencing the final pathname simpler, you can map each path
component; so if you also anonymize subdir
to publicdir
, then the
final pathname would be publicdir/bar.c
.
Before Git 2.34 (Q4 2021), the output from "git fast-export
"(man), when its anonymization feature is in use, showed an annotated tag incorrectly.
See commit 2f040a9 (31 Aug 2021) by Tal Kelrich (hasturkun
).
(Merged by Junio C Hamano -- gitster
-- in commit febba80, 10 Sep 2021)
fast-export
: fix anonymized tag using original length
Signed-off-by: Tal Kelrich
Commit 7f40759 ("fast-export
: tighten anonymize_mem()
interface to handle only strings", 2020-06-23, Git v2.28.0-rc0 -- merge listed in batch #7) changed the interface used in anonymizing strings, but failed to update the size of annotated tag messages to match the new anonymized string.
As a result, exporting tags having messages longer than 13 characters would create output that couldn't be parsed by fast-import, as the data length indicated was larger than the data output.
Reset the message size when anonymizing, and add a tag with a "long" message to the test.