How can I efficiently handle multiple Perl search/replace operations on the same string?

Question

So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:

$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...

As you can see, I'm dealing with nasty html and have to beat it into submission.

I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.

I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:

%rxcheck = (
        'time of day'=>'\d+:\d+', 
    'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
    'ends with a single capital letter'=>'\b[A-Z]\.'
}

And this is how I use it:

 foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
 }

The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.

%rxcheck2 = (
        '(\w) \"'=>'$1\"'
}

The above is to do this:

$snippet =~ s/(\w) \"/$1\"/g;

But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:

if($snippet =~ /$key/$rxcheck2{ $key }/g){  }

And that doesn't work.

So 2 questions:

Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?

Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?

Thanks for your help -

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Hello World, Dec 20 '14 at 20:34

score 10 · Accepted Answer · edited Oct 28 '14 at 10:00

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($text)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

@Chas: Definitely prettier in this case, but how are they safer? — j_random_hacker, May 09 '09 at 17:04
Hmm, I know /e is safer because it is more like eval {} than eval "", but /ee may not be safer, but I can't remember why. — Chas. Owens, May 09 '09 at 17:08
/e is just a string eval. /ee is the same thing, but you take the result of the first /e and do it again. There isn't a safety feature by adding or subtracting an /e. — brian d foy, May 09 '09 at 20:39
I really like John Siracusa's edit, suggesting using "for ($mystr) { ... }" as a way to "topicalise" -- neat! — j_random_hacker, May 12 '09 at 03:37

Chas. Owens · Answer 2 · 2009-05-09T17:20:34.900

4

Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $sub (@subs) {
            $sub->($s);
        }
        return $s;
    },
    refs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $ref (@refs) {
            $s =~ s/$ref->[0]/$ref->[1]/gee;
        }
        return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;

edited May 09 '09 at 17:20

answered May 09 '09 at 16:47

Chas. Owens

64,182
22
135
226

+1. Good point about hashes being unordered -- the order of applying search & replace operations can make a big difference. I'm confused why 2 "e" flags are needed -- wouldn't one be enough? Could you step me through it? – j_random_hacker May 09 '09 at 16:59
1

Due to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '' replacement. – Chas. Owens May 09 '09 at 17:06
You can use Tie::DxHash to maintain insertion order order: http://search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm – Drew Stephens May 09 '09 at 17:15
1

@dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution. – Chas. Owens May 09 '09 at 17:24
@Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :)) – j_random_hacker May 09 '09 at 17:43
1

@j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e. – Chas. Owens May 09 '09 at 18:40
@Chas: I see, thanks. I guess I thought Perl would treat that $ref->[1] as an expression to be interpolated without needing any /e (i.e. in the same way that a plain mention of $foo would be interpolated without /e). Oh well, cryptic Perl parsing rules 1, j_random_hacker 0... – j_random_hacker May 09 '09 at 20:53
1

@j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step. – Chas. Owens May 09 '09 at 21:01
@Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :) – j_random_hacker May 10 '09 at 09:10
@Chas.Owens: +1 for the very interresting (and quite generic) way to time and try different ways. But in general, what is, for you, the **most efficient** way (and I mean, maybe not any of those 2, as those need to call subs, which I believe adds overhead?) to do **many** search/replace in Perl? I'm writing a "colorizer" which looks for various simple-to-complex strings and adds Ansi color codes before and after each (or sometimes portions of them)... And it's sloooow when there are many search/replace or when the files to colorize gets close to several megabytes... – Olivier Dulac Apr 19 '13 at 17:21

score 4 · Answer 3 · answered May 09 '09 at 17:09

4

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

answered May 09 '09 at 17:09

Sinan Ünür

116,958
15
196
339

1

Good point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See http://stackoverflow.com/questions/701166http://stackoverflow.com/questions/701166 for why and http://stackoverflow.com/questions/773340http://stackoverflow.com/questions/773340 for examples on how to use HTML parsers. – Chas. Owens May 09 '09 at 17:31
That is weird, it double pasted the links, let me try again: http://stackoverflow.com/questions/701166 for why. – Chas. Owens May 09 '09 at 18:42
1

HTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data. – brian d foy May 09 '09 at 20:41
HTML::Parser is indeed too much work in most cases. However, there are many libraries that solve many a complicated problem. I have dealt with incredibly badly formed HTML in very large files thanks to such modules. If we knew what information Jeff is trying to get out of these files, a better alternative than a massive block of substitutions with no underlying theme might present itself. – Sinan Ünür May 10 '09 at 02:10

How can I efficiently handle multiple Perl search/replace operations on the same string?

3 Answers3

Problem #1

Problem #2