I know this looks like flame bait, but it is not. hear me out. because stackexchange prefers questions (and this is primarily an answer), let me ask "what is wrong with the following?"
Regex's re not suitable for DOM parsing. Yet, when usable, they have beautiful generalizability, ease of use, and no additional learning curve when compared to complex DOM parsers.
so, I thought I would share a hack that makes regex suitable for quick-and-dirty DOM changes. it adds temporary gid's to html tags and their closing attributes, and then uses regex back references.
the code is mercifully short, and fits with one of the perl philosophies that some solutions can be better if they are shorter on programming time than on run time.
#!/usr/bin/perl -w
use strict;
use warnings FATAL => qw{ uninitialized };
################################################################
sub tokenize {
my $GLOBAL; (defined($_[0])) and do { $GLOBAL = $_; $_ = $_[0]; };
my @xmlstack;
my $gid=0;
my $addgid= sub {
my ($s,$e) = @_;
($e =~ /\/$/) and return "<$s$e/>"; ## could add gid here, too.
if ($s =~ /^\/(.*)/) {
my $off= pop(@xmlstack);
($off->[0] eq $1) or die "not a valid document at id=$gid. (wanted = $off->[0] . had = $s).\n";
return "<$s gid=\"".($off->[1])."\">"; ## not legal html now, but easy to remove
} else {
push(@xmlstack, [$s, ++$gid]);
return "<$s gid=\"$gid\">";
}
};
my $U="#!#";
(/$U/) and die "sorry, this is a hack. pick any other unique string than '$U'\n";
s/<!--(.*?)-->/$U$1$U/gms; # comments can contain html tags
s/\<(\/?[a-zA-Z0-9]*)(.*?)\>/$addgid->($1,$2)/emsg;
s/$U(.*?)$U/<!--$1-->/gms;
(@xmlstack) and die "unfinished business: ".pop(@xmlstack)->[0]."\n";
if ($GLOBAL) { my $CHANGED= $_; $_ = $GLOBAL; return $CHANGED; } else { return $_; }
}
sub untokenize {
my $GLOBAL; (defined($_[0])) and do { $GLOBAL = $_; $_ = $_[0]; };
s/ gid="[0-9]+">/>/g; ## buglet: could mistakenly remove gid from inside comments.
if ($GLOBAL) { my $CHANGED= $_; $_ = $GLOBAL; return $CHANGED; } else { return $_; }
}
################################################################
$_ = "<html>\n<body>\n
<p> <sup>a</sup><sub>b</sub>. </p>.
<hr />
<p> hi<sup>u<sub>ud<sup>udu</sup></sub></sup> </p>.
</body>
</html>
";
tokenize();
## now we can use regex using backreferences
while (/<sup (gid="[0-9]+")>(.*)<\/sup \g1>/gms) {
print "Example matching of all sup's: $1 $2\n"; ## could call recursively
}
## another example: add a class to sup that is immediately followed by a sub
s/\<sup (gid="[0-9]+")\>(.*)<\/sup \g1>\s*\<sub/<sup class="followed" $1>$2<\/sup $1><sup/gms;
print untokenize($_);
this is probably still ignorant of a whole slew of HTML complications, but it can handle a lot of DOM xhtml and xml jobs otherwise not suitable to regex parsing.