Regex to replace pairs %s, \s correctly even with %% and \\

Question

I would like to find a regex that would replace %% with % and %s with my custom string foobar. This is tricker that it sounds, because it should turn %%s into %s and not %foobar, so this naive implementation does not work:

s/%%/%/g
s/%s/foobar/g

This problem is quite common and I've come across it multiple times in my programming life. Not just percent s or percent escaping, but also backslash character or backlash escaping. I'm going to post my usual solution but I'm wondering if there's a better way.

(Allow me to do some keyword stuffing for my future searches: character pairs, backslash backslash, backslash x, percent percent, percent s. Thank you.)

If there are specific language features that would help in this use-case, I'd be interested in hearing what they are.

Example input and output:

input  : test %%, test %s, test %%s too
output : test %, test foobar, test %s too

Another one:

input  : test%%,test%s,test%%stoo
output : test%,testfoobar,test%stoo

keyword stuffing ? You could just star your own question for easy retrieval... as it stands, the offending paragraph will most likely be deleted without notice. — SirDarius, Feb 03 '16 at 15:28
@SirDarius I hope not because I am genuinely trying to be helpful to Googlers everywhere, including myself, it has happened to me that I've found my own answers on SO after searching on Google for a solution that I've forgotten about. Also, searching for %s %% \\ and so on is a *pain* on Google. — Flimm, Feb 03 '16 at 15:33
Language? And do your symbols always have e.g. empty spaces around them? Because matching against `\b%%\b` would accomplish that. e.g. do you have to ever match `%%someword` or `stuff%%morestuff`? — Sobrique, Feb 03 '16 at 15:41
Then it means that the problem is not correctly enunciated, because it might apply to any character, not only '\', '%'. Your litteral characters will not help people searching for a solution for a similar problem with, say `$$` :) — SirDarius, Feb 03 '16 at 15:41
Similar/Dupe [Replace ,(comma) by .(dot) and .(dot) by ,(comma)](http://stackoverflow.com/questions/34238005/replace-comma-by-dot-and-dot-by-comma) The base problem is same, once first thing is replaced, second `replace()` will override the first replaced string. — Tushar, Feb 03 '16 at 15:47
@Sobrique: no, the symbols do not always have empty spaces around them. `test%%test%stest%%stoo` should turn into `test%testfoobartest%stoo`. — Flimm, Feb 03 '16 at 15:58
Can I suggest some example input/output would help this question? — Sobrique, Feb 03 '16 at 15:59

anubhava · Answer 1 · 2016-02-03T15:46:32.657

This can be simplified and can be done in single replace call:

var str = "test %s, test %%, test %%s too";
var output = str.replace(/%%|(%s)/g, function($0, $1){
     return $1!==undefined?'foobar':'%'; });
//=> test foobar, test %, test %s too

We use alternation first /%%/(%s)/ and use a capturing group while matching (%s). In the replace callback we use $1!==undefined to decide what string to be used as replacer.

Sobrique · Answer 2 · 2016-02-03T16:10:52.680

The thing with regular expressions is - if you run them twice, they get applied twice.

So yes - you're implementation isn't going to work, because you 'search twice' - after your first replace, you have no way to tell the difference.

So how about instead;

#!/usr/bin/env perl

use strict;
use warnings;

my %replace = ( '%%' => '%',  
                '%s' => 'foobar' );

my $search = join ( "|", keys %replace );
   $search = qr/($search)/; 

print "Search regex: $search\n";
while ( <DATA> ) {
   s/$search/$replace{$1}/g;
   print;
}

##output : test %, test foobar, test %s too
##output : test%,testfoobar,test%stoo

__DATA__
test %%, test %s, test %%s too
test%%,test%s,test%%stoo

That's doing it perlishly, but you're building a lookup table - capturing the left hand side, and looking up what it should replace with on the right. (You can turn this into a one liner too).

Output:

Search regex: (?^:(%%|%s))
test %, test foobar, test %s too
test%,testfoobar,test%stoo

Pretty sure you should be able to implement this in most languages.

As an alternative, it's probably worth considering regex lookaround which lets you - if you do your regexs in the opposite order:

#!/usr/bin/env perl

use strict;
use warnings;

while ( <DATA> ) {
   s/(?<!%)%s/foobar/g;
   s/%%/%/g;
   print;
}

##output : test %, test foobar, test %s too
##output : test%,testfoobar,test%stoo

__DATA__
test %%, test %s, test %%s too
test%%,test%s,test%%stoo

(?<!%) is a zero width assertion that says 'not preceeded by a percent' - so it runs through and replaces just %s with "foobar" (but ignores %%s). And then applies the secondary transform, which doesn't catch 'foobar' because it doesn't have a %% in there.

Output:

test %, test foobar, test %s too
test%,testfoobar,test%stoo

Downside of this approach is that not all languages properly support look around. (It's an 'advanced regex' thing, not 'basic')

Why did you decide to edit out the lookaround solution? That was interesting and unlike the other answers, and I upvoted it. — Flimm, Feb 03 '16 at 16:08
Re-editing it in, based on test data. Realised I'd made a silly mistake (omitted the 'g' flag) — Sobrique, Feb 03 '16 at 16:09

score 1 · Answer 3 · answered Feb 03 '16 at 16:57

The general problem of escape sequences is not optimally solved by regular expression substitution.

You have to think of your string as a sequence of tokens evaluated lexically by a state machine.

You start by being in a NORMAL state.
In the normal state, any character that you encounter is copied-as-is to the output, unless it is a %, in which case you enter a state PERCENT.
In that state, you can encounter a %, then you output % and return to NORMAL.
You can also encounter a s, and then you pop the next substitution string, output it, and return to NORMAL.
Finally depending on the behavior you need, any other character encountered in the PERCENT state can yield an error, or be ignored...

Example javascript code:

function parseString(s, vars) {
    var NORMAL = 0, PERCENT = 1;

    var state = NORMAL;
    var varidx = 0;
    var output = '';
    for (var i = 0; i <  s.length; i++) {
        if (state == NORMAL) {
            if (s[i] == '%') {
                state = PERCENT;
            } else {
                output += s[i];
            }
        } else if (state == PERCENT) {
            if (s[i] == '%') {
                output += s[i];
                state = NORMAL;
            } else if (s[i] == 's') {
                output += vars[varidx++];
                state = NORMAL;
            } else {
                throw 'Invalid syntax';
            }
        }
    }
    return output;
}

Example:

parseString("test %%, test %s, test %%s too", ['foo']);
// returns "test %, test foo, test %s too"

While this approach is more code than regexp-based solutions, it is probably faster because regular expressions involve a much greater degree of complexity, and it allows you to handle invalid syntax the way that fits you best.

score 0 · Answer 4 · answered Feb 03 '16 at 15:25

This is one easy way of doing it. Split the string into an array of chunks, for lack of better term, using the regex /%%|%s|./, so that each chunk is a character or an escaped character, and then check each individual chunk for %s and %% like this, do the unescaping, and join the array again, like this:

Input  : "test %s, test %%, test %%s too"
Array  : ["t", "e", "s", "t", " ", "%s", ",", " ", "t", "e", "s", "t", " ", "%%",
          " ", "t", "e", "s", "t", "%%", "s", " ", "t", "o", "o"]
Output : "test foobar, test %, test %s too"

Here is the same idea in Javascript without using a variable to hold the chunks:

var str = "test %s, test %%, test %%s too";
var output = str.replace(/%%|%s|./g, function(match, capture) { 
  return match.replace("%%", "%").replace("%s", "foobar");
});
console.log("output:", output);

score 0 · Answer 5 · edited May 23 '17 at 11:52

0

Here's javascript way to replace %% by % and %s by foobar.

string.replace(/%%|%s/g, function (match) {
    // If %% is matched, replace it by %
    // else %s is matched, replace by `foobar`
    return match === '%%' ? '%' : 'foobar';
});

var str = "test %s, test %%, test %%s too";

str = str.replace(/%%|%s/g, function (_) {
    return _ === '%%' ? '%' : 'foobar';
});

console.log(str);
document.body.innerHTML = str;

Using this approach, String#replace is used only once, instead of three times as in other answers.

Here's another approach using the variable swap logic using temp variable. Similar to this answer by @torazaburo

str
    .replace(/%%/g, '2percent') // Replace first string by some string that will not possibly be appear/present in the main string
    .replace(/%s/g, 'foobar') // Replace second string
    .replace(/2percent/g, '%'); // Replace temp by the normal string

var str = "test %s, test %%, test %%s too";

str = str
    .replace(/%%/g, '2percent')
    .replace(/%s/g, 'foobar')
    .replace(/2percent/g, '%');

console.log(str);
document.body.innerHTML = str;

edited May 23 '17 at 11:52

Community

1
1

answered Feb 03 '16 at 15:37

Tushar

85,780
21
159
179

Sorry just noticed your answer after posting mine. I will leave it for now as there is some difference in approaches (checking for matched string vs checking for presence of a captured group) – anubhava Feb 03 '16 at 15:43
1

@anubhava No prob. and no need to say sorry. – Tushar Feb 03 '16 at 15:45
The first approach looks good but the second approach would fail in corner-cases where the input string has `2percent` in it. – Flimm Feb 03 '16 at 15:55
@Flimm That's what I've said in the comments, use the temp string which is less likely present in the actual string. – Tushar Feb 03 '16 at 15:57

Regex to replace pairs %s, \s correctly even with %% and \\

Example input and output:

5 Answers5