5

So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:

$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The 
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...

As you can see, I'm dealing with nasty html and have to beat it into submission.

I'm hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.

I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:

%rxcheck = (
        'time of day'=>'\d+:\d+', 
    'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
    'ends with a single capital letter'=>'\b[A-Z]\.'
}

And this is how I use it:

 foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah  }
 }

The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with... and there is a $1 or $2 in it.

%rxcheck2 = (
        '(\w) \"'=>'$1\"'
}

The above is to do this:

$snippet =~ s/(\w) \"/$1\"/g;

But I can't seem to pass the "$1" part into the regex literally (I think that's the right word... it seems the $1 is being interpreted even though I used ' marks.) So this results in:

if($snippet =~ /$key/$rxcheck2{ $key }/g){  }

And that doesn't work.

So 2 questions:

Easy: How do I handle large numbers of regex's in an easily editable way so I can change and add them without just cut and pasting the line before?

Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?

Thanks for your help -

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Jeff
  • 717
  • 2
  • 8
  • 19
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Hello World Dec 20 '14 at 20:34

3 Answers3

10

Problem #1

As there doesn't appear to be much structure shared by the individual regexes, there's not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move $text into $_, so that instead of having to say:

$text =~ s/foo/bar/g;

You can just say:

s/foo/bar/g;

A common idiom for doing this is to use a degenerate for() loop as a topicalizer:

for($text)
{
  s/foo/bar/g;
  s/qux/meh/g;
  ...
}

The scope of this block will preserve any preexisting value of $_, so there's no need to explicitly localize $_.

At this point, you've eliminated almost every non-boilerplate character -- how much shorter can it get, even in theory?

Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.

Problem #2

You can use the qr// syntax to quote the "search" part of the substitution:

my $search = qr/(<[^>]+>)/;
$str =~ s/$search/foo,$1,bar/;

However I don't know of a way of quoting the "replacement" part adequately. I had hoped that qr// would work for this too, but it doesn't. There are two alternatives worth considering:

1. Use eval() in your foreach loop. This would enable you to keep your current %rxcheck2 hash. Downside: you should always be concerned about safety with string eval()s.

2. Use an array of anonymous subroutines:

my @replacements = (
    sub { $_[0] =~ s/<[^>]+>/ /g; },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/[\(\{\[]\d+[\(\{\[]/ /g; },
    sub { $_[0] =~ s/\s+[<>]+\s+/\. /g },
    sub { $_[0] =~ s/\s+/ /g; },
    sub { $_[0] =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; },
    sub { $_[0] =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; }
);

# Assume your data is in $_
foreach my $repl (@replacements) {
    &{$repl}($_);
}

You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.

Olivier Dulac
  • 3,695
  • 16
  • 31
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
4

Hashes are not good because they are unordered. I find an array of arrays whose second array contains a compiled regex and a string to eval (actually it is a double eval) works best:

#!/usr/bin/perl

use strict;
use warnings;

my @replace = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my $s = "foo bar baz foo bar baz";

for my $replace (@replace) {
    $s =~ s/$replace->[0]/$replace->[1]/gee;
}

print "$s\n";

I think j_random_hacker's second solution is vastly superior to mine. Individual subroutines give you the most flexibility and are an order of magnitude faster than my /ee solution:

bar <bar> baz bar <bar> baz
bar <bar> baz bar <bar> baz
         Rate refs subs
refs  10288/s   -- -91%
subs 111348/s 982%   --

Here is the code that produces those numbers:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

my @subs = (
    sub { $_[0] =~ s/(bar)/<$1>/g },
    sub { $_[0] =~ s/foo/bar/g },
);

my @refs = (
    [ qr/(bar)/ => '"<$1>"' ],
    [ qr/foo/   => '"bar"'  ],
);

my %subs = (
    subs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $sub (@subs) {
            $sub->($s);
        }
        return $s;
    },
    refs => sub {
        my $s = "foo bar baz foo bar baz";
        for my $ref (@refs) {
            $s =~ s/$ref->[0]/$ref->[1]/gee;
        }
        return $s;
    }
);

for my $sub (keys %subs) {
    print $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
  • +1. Good point about hashes being unordered -- the order of applying search & replace operations can make a big difference. I'm confused why 2 "e" flags are needed -- wouldn't one be enough? Could you step me through it? – j_random_hacker May 09 '09 at 16:59
  • 1
    Due to a bug in the flag evaluation portion of regexes people found that each extra e added another level of eval. This was found to be handy, so it got promoted to a feature. With /e the first replace becomes '<$1>', that is you see '<$1>' in $s. The second e then evals '<$1>' producing the desired '' replacement. – Chas. Owens May 09 '09 at 17:06
  • You can use Tie::DxHash to maintain insertion order order: http://search.cpan.org/~kruscoe/Tie-DxHash-1.05/lib/Tie/DxHash.pm – Drew Stephens May 09 '09 at 17:15
  • 1
    @dinomite Yes, but at the loss of performance with no real gain in readability. This isn't really a job for a hash (keys are not randomly accessed, there is no need for unique keys, the data is not unordered, etc). An array of coderefs seems to be the best solution. – Chas. Owens May 09 '09 at 17:24
  • @Chas: Thanks, but I'm wondering why you could/would not just say qr/(bar)/ => '<$1>' and then use a single /e. (I'm aware of /ee, /eee etc... so far I haven't found cause to use them but I'm on the lookout :)) – j_random_hacker May 09 '09 at 17:43
  • 1
    @j_random_hacker because /e is evaluating $ref->[1] not the contents of $ref->[1]. The double quoted string nature of the replace is removed when you say /e. – Chas. Owens May 09 '09 at 18:40
  • @Chas: I see, thanks. I guess I thought Perl would treat that $ref->[1] as an expression to be interpolated without needing any /e (i.e. in the same way that a plain mention of $foo would be interpolated without /e). Oh well, cryptic Perl parsing rules 1, j_random_hacker 0... – j_random_hacker May 09 '09 at 20:53
  • 1
    @j_random_hacker $ref->[1] is interpolated when there is no /e, but when /e is in effect there is no interpolation step. – Chas. Owens May 09 '09 at 21:01
  • @Chas: I think I've finally got it -- /e implies no interpolation (like single quotes). Thanks for your patience :) – j_random_hacker May 10 '09 at 09:10
  • @Chas.Owens: +1 for the very interresting (and quite generic) way to time and try different ways. But in general, what is, for you, the **most efficient** way (and I mean, maybe not any of those 2, as those need to call subs, which I believe adds overhead?) to do **many** search/replace in Perl? I'm writing a "colorizer" which looks for various simple-to-complex strings and adds Ansi color codes before and after each (or sometimes portions of them)... And it's sloooow when there are many search/replace or when the files to colorize gets close to several megabytes... – Olivier Dulac Apr 19 '13 at 17:21
4

You say you are dealing with HTML. You are now realizing that this is pretty much a losing battle with fleeting and fragile solutions.

A proper HTML parser would be make your life easier. HTML::Parser can be hard to use but there are other very useful libraries on CPAN which I can recommend if you can specify what you are trying to do rather than how.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • 1
    Good point, I was answering the general question of how do run multiple regexes against a string in a maintainable way, but the specific question is about running a regex on HTML, which is a no-no. See http://stackoverflow.com/questions/701166http://stackoverflow.com/questions/701166 for why and http://stackoverflow.com/questions/773340http://stackoverflow.com/questions/773340 for examples on how to use HTML parsers. – Chas. Owens May 09 '09 at 17:31
  • That is weird, it double pasted the links, let me try again: http://stackoverflow.com/questions/701166 for why. – Chas. Owens May 09 '09 at 18:42
  • 1
    HTML::Parser is often too much work for the nastiness of some data sources. If you can do a bunch of quick substitutions to regularize the input, you can make things easier down the road. This isn't a question about parsing HTML, but cleaning up dirty data. – brian d foy May 09 '09 at 20:41
  • HTML::Parser is indeed too much work in most cases. However, there are many libraries that solve many a complicated problem. I have dealt with incredibly badly formed HTML in very large files thanks to such modules. If we knew what information Jeff is trying to get out of these files, a better alternative than a massive block of substitutions with no underlying theme might present itself. – Sinan Ünür May 10 '09 at 02:10