2

I'm trying to rewrite history, using:

git filter-branch --tree-filter 'git ls-files -z "*.php" |xargs -0 perl -p -i -e "s#(PASSWORD1|PASSWORD2|PASSWORD3)#xXxXxXxXxXx#g"' -- --all

as described in this tutorial.

However, the password strings I have contain all kinds of non- A-Z characters, e.g. $ ' and \, rather than being nice simple 'PASSWORD1' type strings in the example above.

Can someone explain what escaping I need? I've not been able to find this anywhere, and I've been battling with this for hours.

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
fooquency
  • 1,575
  • 3
  • 16
  • 29
  • This isn't an answer to the question as it stands. But if the passwords don't ever change from their first introduction to the repository, it would surely be easier to anonymise them (by script or by hand), commit that, and then rebase the commit to rewrite history. – James Cranch Sep 05 '13 at 23:25
  • @fooquency Please try my script and tell me what errors you might see. – konsolebox Sep 05 '13 at 23:44

4 Answers4

3

try the BFG instead of git filter-branch...

You can use a much more friendly substitution format if you use The BFG rather than git-filter-branch. Create a passwords.txt file, with one password per line like this:

PASSWORD1==>xXxXx      # Replace literal string 'PASSWORD1' with 'xXxXx'
ezxcdf\fr$sdd%==>xXxXx # ...all text is matched as a *literal* string by default

Then run the BFG with this command:

$ java -jar bfg.jar -fi '*.php' --replace-text passwords.txt  my-repo.git

Your entire repository history will be scanned, and all .php files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.

...no escaping needed

Note that the only bit of parsing the BFG does with here with the substitution file is to split on the '==>' string - which probably isn't in your passwords - and all text is interpreted literally by default.

If you want to be even more concise, you can drop the '==>' and everything that comes after it on each line (ie, just have a file of passwords) and The BFG will replace each password with the string '***REMOVED***' by default.

The BFG is typically hundreds of times faster than running git-filter-branch on a big repo and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
  • `*.php` should be quoted, it seems: `'*.php'` – fooquency Sep 06 '13 at 09:37
  • On a quad-core machine, with 800 strings and a 500MB repository, this seems to be much slower than git-filter-branch, taking 5 minutes to get to 1%. Is that to be expected with this volume? – fooquency Sep 06 '13 at 09:52
  • @fooquency - thanks for that speed test, very interesting to me - literally the first time I've heard of the BFG being slower than git-filter-branch! It could be that 800 strings is quite a lot to match against- could you time a couple of test runs with an identical setup but just a single password, and then repeat with a few more? It's best to run from a fresh copy of the repo each time, so it's probably worth taking a zip of a fresh copy of your repo, and then unzipping a fresh copy for each run. I'll try to recreate the experiment with a repo of similar size (eg the linux kernel) – Roberto Tyley Sep 06 '13 at 10:09
  • Thanks. Am going to try this on a second copy of the repo, as the filter-branch method looks like it's going to take longer to complete than I thought, so this may well end up being quicker. – fooquency Sep 07 '13 at 10:15
  • I'm now running a parallel instance, and it seems to be about 10x faster. Two questions: is there an option to change the default replacement string, to avoid having to pre-process the strings file to add ==>xXxXxXxX rather than *** REMOVED *** (purely for aesthetic reasons)? Also, the "..contains n dirty files" check at the start is useful. But is there a way to make it list all of them (or limit to say, 10, rather just the first two), and moreover, to state the specific string that matched and the line number? One file has many strings, so is hard to find what's been forgotten. – fooquency Sep 08 '13 at 00:03
  • @fooquency glad to hear it's going faster for you! Answers: 1) Currently there's no option to change the default for the replacement string from '*** REMOVED ***', and while it's a reasonable feature request, I'm somewhat biased towards restricting the number of command-line options in order to not overwhelm new users - if the user really cares that much about the aesthetics of their replacement string, it's not too much work for them to add the extra detail to the passwords file, as you did 2) Yup, that makes sense - probably logging out a full diff report to a separate file. – Roberto Tyley Sep 08 '13 at 07:14
  • sorry - just saying that 2) was a good idea, but there's no way to make the BFG do it yet - I'll work on it for a forthcoming release. – Roberto Tyley Sep 08 '13 at 07:22
  • 1
    Thanks. Yes, it completed much faster and has done the job. Thanks for your work on the software. – fooquency Sep 09 '13 at 00:03
  • Having an option e.g. --showunclean would be really useful - I'm having real difficulty identifying the strings in some of the files, and iterating (only two at a time) is slow. – fooquency Sep 15 '13 at 13:34
  • sorry it took so long to get back to you @fooquency - as of v1.11.0, The BFG now writes full reports on the 'dirty' files in your protected commits. The reports are written as CSV files, and line numbers within the affected files are included. – Roberto Tyley Oct 01 '13 at 21:24
1

Building on the brilliant help given by konsolebox which really helped me solve this, the solution I ended up using in terms of doing it via the shell was:

Define the strings in a file, strings.txt

string1
another$string
yet! @nother string
some more stuff to re\move

Create a Perl script perl-escape-strings.pl which will be used to escape the strings, where xXxXxXxXxXx is the string they will all be replaced with

#!/usr/bin/perl

use strict;
use warnings;

while (<>)
{
        chomp;
        my $passwd = quotemeta($_);
        print qq|s/$passwd/xXxXxXxXxXx/g;\n|;
}

exit 0;

Bash script:

# Pre-process the strings
./perl-escape-strings.pl strings.txt > strings-perl-escaped.txt

# Change directory to the repo
cd repo/

# Define the filter command
FILTER="git ls-files -z '*.html' '*.php' | xargs -0 perl -p -i ../strings-perl-escaped.txt"

# Run the filter
git filter-branch --tree-filter "$FILTER" -- --all

However, because the number of strings is large, and my repository is large and with many thousand commits, the filter-branch method is taking a long time. So I'm going to try The BFG mentioned in another answer also in parallel, to see if it completes quicker.

Community
  • 1
  • 1
fooquency
  • 1,575
  • 3
  • 16
  • 29
0

Using a wrapper script:

#!/bin/bash

readarray -t PASSWORDS < list_file

REPLACEMENT='xXxXxXxXxXx'
SEP=$'\xFF'

EXPR=${PASSWORDS[0]}
for (( I = 1; I < ${#PASSWORDS[@]}; ++I )); do
    EXPR+="|${PASSWORDS[I]}"
done
EXPR="s${SEP}(${EXPR})${SEP}$REPLACEMENT${SEP}g"
EXPR=${EXPR//'\'/'\\\\'}; EXPR=${EXPR//'$'/'\\\$'}
EXPR=${EXPR//'"'/'\"'};   EXPR=${EXPR//'`','\`'}
EXPR=${EXPR//'^','\\^'};  EXPR=${EXPR//'[','\\['}
EXPR=${EXPR//']','\\]'};  EXPR=${EXPR//'+','\\+'}
EXPR=${EXPR//'?','\\?'};  EXPR=${EXPR//'.','\\.'}
EXPR=${EXPR//'*','\\*'};  EXPR=${EXPR//'{','\\{'}
EXPR=${EXPR//'}','\\}'};  EXPR=${EXPR//'(','\\('}
EXPR=${EXPR//')','\\)'}

FILTER="git ls-files -z '*.php' | xargs -0 perl -p -i -e \"$EXPR\""

echo "Number of passwords: ${#PASSWORDS[@]}"    
echo "Passwords:" "${PASSWORDS[@]}"
echo "EXPR: $EXPR"
echo "FILTER: $FILTER"

git filter-branch --tree-filter "$FILTER" -- --all
konsolebox
  • 72,135
  • 12
  • 99
  • 105
  • Thanks; am just trying now. Yes, quite a number of them do have ' . – fooquency Sep 05 '13 at 23:49
  • I think the regexp match half isn't quite right, because there's no | separator: `$ echo $FILTER` gives: `git ls-files -z '*.txt' | xargs -0 perl -p -i -e 's?(foo bar)?xXxXxXxXxXx?g'` – fooquency Sep 05 '13 at 23:54
  • Incidentally, the passwords are coming originally from a file, one per line, so presumably the $PASSWORDS assignment can be done simply using `PASSWORDS=(\`cat "/path/to/file"\`)` – fooquency Sep 05 '13 at 23:58
  • @fooquency I tried the script with just an echo on git's command I had an output like this: `Passwords: PASSWORD1 PASSWORD2 PASSWORD3 git filter-branch --tree-filter git ls-files -z '*.php' | xargs -0 perl -p -i -e "s�(PASSWORD1|PASSWORD2|PASSWORD3)�xXxXxXxXxXx�g" -- --all`. The passwords are separated well with `|`. The value of IFS variable does it. – konsolebox Sep 06 '13 at 00:01
  • @fooquency Do you mean each passwords are in a file line by line? It's actually better that way. You could just use `readarray` to get them, and it doesn't need to be quoted. – konsolebox Sep 06 '13 at 00:02
  • Yes, the passwords are indeed one per line, unquoted. – fooquency Sep 06 '13 at 00:12
  • Hmm.. I'm definitely not seeing a pipe. If I do `echo $EXPR` after the first EXPR= line, I see each string, but with a space between each. I'm on an Ubuntu machine; perhaps there is something Ubuntu-specific going on here? – fooquency Sep 06 '13 at 00:13
  • @fooquency Is the script ran through `bash script.sh`? – konsolebox Sep 06 '13 at 00:15
  • I've tried that, and I get the same: spaces shown. The password file does contain some lines with spaces in, since some are phrases that have to be cleared. Will that present a problem? – fooquency Sep 06 '13 at 00:17
  • Yes but only if you don't use `readarray`. With `readarray`, the whole line is included as a password, even the leading and trailing spaces. Try to used `readarray` instead. It might also give a difference on the output of the variable. If it still doesn't work, we'll join them manually with a loop. – konsolebox Sep 06 '13 at 00:20
  • Yes, I'm using readarray at the start. (NB I assume by trailing spaces you are not including the newline itself.) The echo `$PASSWORDS line` shows a very long string without linebreaks, only a space between. – fooquency Sep 06 '13 at 00:22
  • Incidentally, If I do, on the command line, `IFS='|'` then `echo $IFS` the result is a blank line. So it's definitely not being set. – fooquency Sep 06 '13 at 00:23
  • I made an update that doesn't depend on IFS. Please try again. I'm actually having an idea that perhaps the textfile is not in `\n` line endings? – konsolebox Sep 06 '13 at 00:24
  • It's possible that at some point they were edited on a Windows machine. I'm not sure. I've run dos2unix on it to make sure. – fooquency Sep 06 '13 at 00:27
  • By the way the `echo "Passwords:" "${PASSWORDS[@]}"` line does only separates the passwords with spaces between them since it's not yet a formatted one. I hope you tried the new script already. If it's true the line endings of the file is CRLF you could just use dos2unix for it.. – konsolebox Sep 06 '13 at 00:34
  • The debug echoing of the $EXPR line or $FILTER line definitely isn't showing the pipe character. However, if I breakpoint this just after the loop that adds the pipe, it is there. The pipe is being lost during `EXPR="s${SEP}(${PASSWORDS[*]})${SEP}$REPLACEMENT${SEP}g"` it seems. Ah, actually presumably that line should no longer have `(${PASSWORDS[*]})` in? – fooquency Sep 06 '13 at 00:56
  • Oh yes sorry I shouldn't have included that. I made the update. – konsolebox Sep 06 '13 at 01:01
  • Right, a clear sign of progress: I am seeing some string replacement now. However, if the source code string contains a $ in the middle, the subsequent characters remain. So for instance, if a password were abc$efg, after running the script, the source in the repo contains xXx$efg. Do you have any ideas on that? (Thank you enormously for your help so far - I have learnt alot, even working past midnight here!) – fooquency Sep 06 '13 at 01:12
  • Probably we still need to quote `$` further with respect to its syntax with perl's command s. Try to change `EXPR=${EXPR//'$'/'\$'}` to `EXPR=${EXPR//'$'/'\\\$'}`. – konsolebox Sep 06 '13 at 01:19
  • Also, the script won't run (i.e. an invocation runs but doesn't complete, leaving an empty line on the terminal as if waiting for something else to be entered) if a string in the passwords file contains \ or $ . If I remove those lines, it will complete. Ah - will try your latest suggestion just now which crossed in the post. – fooquency Sep 06 '13 at 01:21
  • Perhaps you should do it with \\ as well: `EXPR=${EXPR//'\'/'\\\\'}`. Note that it should always be at the first of those substitution commands. – konsolebox Sep 06 '13 at 01:26
  • I've not put in that change yet. (Would you mind editing the main entry, so I can be sure I'm not editing the wrong thing? I've just spotted that what looked cryptic is actually not as bad as I thought - I see now it is several distinct commands; might be better to put these on separate lines to be clear). Anyway, I was writing to say: I've just tried the passwords file on itself. Most stuff is now being turned into xXxXxXxXxXx - which is good news. However, those strings with the following characters do not get wiped out: @[]()?/ which frankly is a rather familiar-looking list.. – fooquency Sep 06 '13 at 01:31
  • I think I see what those clauses are doing now. Hadn't come across that replacement syntax within a variable before.. – fooquency Sep 06 '13 at 01:38
  • Sorry for the late reply. I tried to analyze well the quoting should be done and mapped them with an editor. I try to imagine how it would affect the syntax to Perl in the end. I hope it works now this time. – konsolebox Sep 06 '13 at 02:09
  • Thank you again for your incredible persistence. I tried that, and it didn't quite work - that list of special characters was still appearing. However, I realised that I could use \Q...\E within each string. This now works for almost everything - only passwords with $ and @ are not being wiped out, and the script runs only if I remove those passwords with \ in. So it feels like I'm nearly there now. – fooquency Sep 06 '13 at 02:46
  • I ran out of ideas :) Testing a perl command here, this works for me: `echo '[]{}().*|@$?\' | perl -p -e 's:\[\]\{\}\(\)\.\*\|\@\$\?\\:works:'`. So I'm now confused how my last method didn't work. I actually tried `\Q\E` as well but I'm not sure how I'd really implement it since some characters like '$' is not inclusive for it. But I'll try again later :) – konsolebox Sep 06 '13 at 03:56
  • 1
    With some help from a local Perl expert, we found that the solution was to pre-process the strings via a small Perl script first, to do all the escaping, then assemble them into a string, then run the filter command. – fooquency Sep 07 '13 at 09:55
  • @fooquency Good thing you solved it already :) Seems like depending on bash alone was next to impossible. – konsolebox Sep 07 '13 at 09:58
  • I've just posted the complete solution I used. However, your pointers were extremely helpful, and helped me understand this stuff much more clearly - so thanks again. – fooquency Sep 07 '13 at 10:14
  • @fooquency No worries. Welcome :) This thread could be helpful to me as well someday who knows. Added it to my favorites :) – konsolebox Sep 07 '13 at 10:23
0

Build it from the inside out. Say the password is

a$b'c\d

The regex pattern would be

a\$b'c\\d

One possibility for the perl command would be

perl -i -pe's/a\$b'\''c\\d/.../g'

(Note how each ' was replaced with '\''.)

Now you need to include that in single quotes, so you repeat the process.

... '... perl -i -pe'\''s/a\$b'\''\'\'''\''c\\d/.../g'\''' ...
ikegami
  • 367,544
  • 15
  • 269
  • 518