More Efficient Way to Find/Replace Non-Escaped Characters

Question

I'm trying to find the best way to find and replace (in Ruby 1.9.2) all instances of a special code (%x) preceded by zero, or an even number of backslashes.

In other words, :

%x      -->   FOO
\%x     -->   \%x
\\%x    -->   \\FOO
\\\%x   -->   \\\%x
\\\\%x  -->   \\\\FOO
etc.

There may be multiple instances in a string: "This is my %x string with two %x codes."

With help from the questions asked here and here I got the following code to do what I want:

 str.gsub(/
  (?<!\\)           # Not preceded by a single backslash
  ((?:\\\\)*)       # Eat up any sets of double backslashes - match group 1  
  (%x)              # Match the code itself - match group 2
  /x, 

  # Keep the double backslashes (match group 1) then put in the sub
  "\\1foo")

That regex seems kind of heavyweight, though. Since this code will be called with reasonable frequency in my application, I want to make sure I'm not missing a better (cleaner/more efficient) way to do this.

score 1 · Accepted Answer · answered Sep 28 '12 at 14:49

I can imagine two alternative regular expressions:

Using a look-behind assertion, as in your code. (look-behind-2)
Matching one more character, before the back-slashes. (alternative)

Other than that, I do only see a minor optimization for your regular expression. The "%x" is constant, so you do not have to capture it. (look-behind-1)

I am not sure which of these is actually more efficient. Therefore, I created a small benchmark:

$ perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $test = '%x \%x \\%x \\\%x \\\\%x \\\\\%x \\\\%x \\\%x \\%x \%x %x';

cmpthese 1_000_000, {
    'look-behind-1' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)\%x/${1}foo/g },
    'look-behind-2' => sub { (my $t = $test) =~ s/(?<!\\)((?:\\\\)*)(\%x)/${1}foo/g },
    'alternative'   => sub { (my $t = $test) =~ s/((?:^|[^\\])(?:\\\\)*)\%x/${1}foo/g },
};

Results:

                  Rate   alternative look-behind-2 look-behind-1
alternative   145349/s            --          -23%          -26%
look-behind-2 188324/s           30%            --           -5%
look-behind-1 197239/s           36%            5%            --

As you can clearly see, the alternative regular expression is far behind the look-behind approach and capturing the "%x" is slightly slower than not capturing it.

regards, Matthias

More Efficient Way to Find/Replace Non-Escaped Characters

1 Answers1