Regex simple replace document from dictionary hash (Perl)

Question

I need to find and replace keywords from a hash in a large documents as fast as possible. I tired the below two methods, one is faster by 320% but I am sure I am doing this the wrong way and sure there is a better way to do it.

The idea I want to replace only the keywords that exist in the dictionary hash and keep those that does not exist so I know it is not in the dictionary.

Both methods below scan twice to find and replace as I think. I am sure the regex like look ahead or behind can optimize it much faster.

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(:all);

my %dictionary = (
            pollack => "pollard",
            polynya => "polyoma",
            pomaces => "pomaded",
            pomades => "pomatum",
            practic => "praetor",
            prairie => "praised",
            praiser => "praises",
            prajnas => "praline",
            quakily => "quaking",
            qualify => "quality",
            quamash => "quangos",
            quantal => "quanted", 
            quantic => "quantum",
    );

my $content =qq{
        Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
        can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
        Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
        one kind of problem {qualify} {doesNotExist} end.
    };

# just duplicate content many times
$content .= $content;

cmpthese(100000, {
    replacer_1 => sub {my $text = replacer1($content)},
    replacer_2 => sub {my $text = replacer2($content)},
});

print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;

sub replacer1 {
    my ($content) = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub replacer2 {
    my ($content) = shift;
    my @names = $content =~ /\{(.+?)\}/g;
    foreach my $name (@names) {
        if (exists $dictionary{$name}) {
            $content =~ s/\{$name\}/\[$dictionary{$name}\]/;
        }
    }
    return $content;
}

Here is the benchmark result:

              Rate replacer_2 replacer_1
replacer_2  5565/s         --       -76%
replacer_1 23397/s       320%         --

mob · Answer 1 · 2014-08-02T19:26:32.353

3

Here's a way that's a little faster and more compact:

sub replacer3 {
    my ($content) = shift;
    $content =~ s#\{(.+?)\}#"[".($dictionary{$1} // $1)."]"#ge;
    return $content;
}

In Perl 5.8, it is ok to use || instead of // if none of your dictionary values are "false".

There's also a little to be gained by using a dictionary that already contains the braces and brackets:

sub replacer5 {
    my ($content) = shift;
    our %dict2;
    if (!%dict2) {
        %dict2 = map { "{".$_."}" => "[".$dictionary{$_}."]" } keys %dictionary
    }
    $content =~ s#(\{.+?\})#$dict2{$1} || $1#ge;
    return $content;
}

Benchmark results:

              Rate replacer_2 replacer_1 replacer_3 replacer_5
replacer_2  2908/s         --       -79%       -83%       -84%
replacer_1 14059/s       383%         --       -20%       -25%
replacer_3 17513/s       502%        25%         --        -7%
replacer_5 18741/s       544%        33%         7%         --

edited Aug 02 '14 at 19:26

answered Aug 02 '14 at 18:53

mob

117,087
18
149
283

This gets rid of the braces. – NigoroJr Aug 02 '14 at 18:59
My bad, and that's an easy fix -- let's see if it has a large effect on the time. – mob Aug 02 '14 at 19:05
good but the defined or operator `//` is not supported in Perl 5.8.x, I need to support those versions too. – daliaessam Aug 02 '14 at 19:07
@mob +1 Either way, it's amazing how much difference `//` makes from the ternary operator... – NigoroJr Aug 02 '14 at 19:14
@daliaessam, If you knew ahead of time the string was a word character for example you *might* try `\w{1,}` instead of `.+?` to squeeze a *tiny* bit more speed out of this, although probably not noticeable. @mob — nice answer! – l'L'l Aug 02 '14 at 20:10
You actually want `\{([{}]+)\}` instead of `\{(.+?)\}` in case there's an unmatched `{` or `}`. – ikegami Aug 02 '14 at 22:17
@l'L'l the `\w{1,}` yes makes a very small difference. – daliaessam Aug 02 '14 at 23:51
@ikegami this `\{([{}]+)\}` does not work, it does not find matches. – daliaessam Aug 02 '14 at 23:53
Oops, typo `\{([^{}]+)\}` – ikegami Aug 03 '14 at 02:34

score 3 · Answer 2 · answered Aug 02 '14 at 19:29

3

It helps to build a regex that will match any of the hash keys beforehand. Like this

my $pattern = join '|', sort {length $b <=> length $a } keys %dictionary;
$pattern = qr/$pattern/;

sub replacer4 {
    my ($string) = @_;
    $string =~ s# \{ ($pattern) \} #"[$dictionary{$1}]"#gex;
    $string;
}

with these results

              Rate replacer_2 replacer_1 replacer_3 replacer_4
replacer_2  4883/s         --       -80%       -84%       -85%
replacer_1 24877/s       409%         --       -18%       -22%
replacer_3 30385/s       522%        22%         --        -4%
replacer_4 31792/s       551%        28%         5%         --

It would also make an improvement if you could the braces and brackets in the hash, instead of having to add them each time.

answered Aug 02 '14 at 19:29

Borodin

126,100
9
70
144

(The sorting isn't necessary here because of the braces.) – ikegami Aug 02 '14 at 22:21
the idea is good, but practically the hash is very long, millions of entries. – daliaessam Aug 02 '14 at 23:31
@daliaessam: That's not a reason to dismiss it, but I hope you are running 64-bit Perl? I'm not at home at present and have only a 32-bit Perl handy. A simple array of two-million random ten-character "words" takes 14 seconds to create (in memory) and 64 seconds to compile as a regex pattern. So you have a *very* fast string recognition machine that is likely to be much faster to create compared to just reading your dictionary from disk. – Borodin Aug 03 '14 at 00:15
1

@daliaessam, Re "the idea is good, but practically the hash is very long, millions of entries", Since 5.10, alternations are implemented using a trie, so that doesn't matter. – ikegami Aug 03 '14 at 02:35

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I'd recommend using meaningful names for your benchmarking subroutines, it'll make the output and intent more clear.

The following reproduces a bit of what Borodin and mob have tried out, and then combines them as well.

#!/usr/bin/perl

use strict;
use warnings;
use feature 'state';

use Benchmark qw(:all);

# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};

# Quadruple Content
$content = $content x 4;

cmpthese(100_000, {
    original        => sub { my $text = original($content) },
    build_list      => sub { my $text = build_list($content) },
    xor_regex       => sub { my $text = xor_regex($content) },
    list_and_xor    => sub { my $text = list_and_xor($content) },
});

exit;

sub original {
    my $content = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub build_list {
    my $content = shift;
    state $list = join '|', map quotemeta, keys %dictionary;
    $content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
    return $content;
}

sub xor_regex {
    my $content = shift;

    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

sub list_and_xor {
    my $content = shift;

    state $list = join '|', map quotemeta, keys %dictionary;
    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted 
quantic quantum

Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.

Outputs:

                Rate     original    xor_regex   build_list list_and_xor
original     19120/s           --         -23%         -24%         -29%
xor_regex    24938/s          30%           --          -1%          -8%
build_list   25253/s          32%           1%           --          -7%
list_and_xor 27027/s          41%           8%           7%           --

My solutions make heavy use of state variables to avoid reinitializing static data structures. However, one could also use closures or our $var; $var ||= VAL.

Addendum about enhancing the LHS of the regex

Actually, editing the LHS to use an explicit list is about improving the regular expression. And this change showed a 30% improvement in speed.

There isn't likely to be any magic solution to this. You have a list of values, that you're wanting to replace. It isn't like there is some mysterious way to simplify the language of this goal.

You could perhaps use a code block in the LHS to Fail and skip if the word does not exist in the dictionary hash. However, the following shows that this is actually 36% slower than your original method:

sub skip_fail {
    my $content = shift;

    $content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;

    return $content;
}

Outputs:

                Rate   skip_fail    original   xor_regex build_list list_and_xor
skip_fail     6769/s          --        -36%        -46%       -49%         -53%
original     10562/s         56%          --        -16%       -21%         -27%
xor_regex    12544/s         85%         19%          --        -6%         -14%
build_list   13355/s         97%         26%          6%         --          -8%
list_and_xor 14537/s        115%         38%         16%         9%           --

all solutions does optimize the code but not the regex itself. what I am looking for is, if any way to advance the the regex if the match is not found in the dictionary array. to say, if match found, replace it, if not found advance the search pointer. can this be done. — daliaessam, Aug 04 '14 at 06:26
Short answer, already showed the likely best possible improvements. Longer answer, read the addendum. — Miller, Aug 05 '14 at 00:32
The skip fail method you posted is what I was thinking as a solution, of course the result is disappinting, can you point me to the regex docs about these (*SKIP)(*FAIL) etc, I only know the very basic regex stuff. — daliaessam, Aug 05 '14 at 00:43
Check out [`perlretut` - Backtracking Control Verbs](http://perldoc.perl.org/perlretut.html#Backtracking-control-verbs) and [`perlre` - Special Backtracking Control Verbs](http://perldoc.perl.org/perlre.html#Special-Backtracking-Control-Verbs). It also uses [Conditional Expressions](http://perldoc.perl.org/perlretut.html#Conditional-expressions) and [Code Blocks](http://perldoc.perl.org/perlretut.html#A-bit-of-magic%3a-executing-Perl-code-in-a-regular-expression) — Miller, Aug 05 '14 at 00:47
just to let you know, using exists instead of ! in the skip fail regex reduced the speed difference to 11% only instead of 36%, I think the exists test is faster as the ! checks if the variable is set of not, so it is good alternative. also thanks for the links, very useful, the only downside for these new extensions they work with Perl 5.10 and up only. — daliaessam, Aug 05 '14 at 01:52

Regex simple replace document from dictionary hash (Perl)

3 Answers3

Addendum about enhancing the LHS of the regex