I need to find and replace keywords from a hash in a large documents as fast as possible. I tired the below two methods, one is faster by 320% but I am sure I am doing this the wrong way and sure there is a better way to do it.
The idea I want to replace only the keywords that exist in the dictionary hash and keep those that does not exist so I know it is not in the dictionary.
Both methods below scan twice to find and replace as I think. I am sure the regex like look ahead or behind can optimize it much faster.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my %dictionary = (
pollack => "pollard",
polynya => "polyoma",
pomaces => "pomaded",
pomades => "pomatum",
practic => "praetor",
prairie => "praised",
praiser => "praises",
prajnas => "praline",
quakily => "quaking",
qualify => "quality",
quamash => "quangos",
quantal => "quanted",
quantic => "quantum",
);
my $content =qq{
Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.
};
# just duplicate content many times
$content .= $content;
cmpthese(100000, {
replacer_1 => sub {my $text = replacer1($content)},
replacer_2 => sub {my $text = replacer2($content)},
});
print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;
sub replacer1 {
my ($content) = shift;
$content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
return $content;
}
sub replacer2 {
my ($content) = shift;
my @names = $content =~ /\{(.+?)\}/g;
foreach my $name (@names) {
if (exists $dictionary{$name}) {
$content =~ s/\{$name\}/\[$dictionary{$name}\]/;
}
}
return $content;
}
Here is the benchmark result:
Rate replacer_2 replacer_1
replacer_2 5565/s -- -76%
replacer_1 23397/s 320% --