Regex Parsing for XML or XHTML DOM in Perl

Question

I know this looks like flame bait, but it is not. hear me out. because stackexchange prefers questions (and this is primarily an answer), let me ask "what is wrong with the following?"

Regex's re not suitable for DOM parsing. Yet, when usable, they have beautiful generalizability, ease of use, and no additional learning curve when compared to complex DOM parsers.

so, I thought I would share a hack that makes regex suitable for quick-and-dirty DOM changes. it adds temporary gid's to html tags and their closing attributes, and then uses regex back references.

the code is mercifully short, and fits with one of the perl philosophies that some solutions can be better if they are shorter on programming time than on run time.

#!/usr/bin/perl -w
use strict;
use warnings FATAL => qw{ uninitialized };

################################################################
sub tokenize {
  my $GLOBAL; (defined($_[0])) and do { $GLOBAL = $_; $_ = $_[0]; };

  my @xmlstack;
  my $gid=0;

  my $addgid= sub {
    my ($s,$e) = @_;

    ($e =~ /\/$/) and return "<$s$e/>";  ## could add gid here, too.

    if ($s =~ /^\/(.*)/) {
      my $off= pop(@xmlstack);
      ($off->[0] eq $1) or die "not a valid document at id=$gid. (wanted = $off->[0] . had = $s).\n";
      return "<$s gid=\"".($off->[1])."\">"; ## not legal html now, but easy to remove
    } else {
      push(@xmlstack, [$s, ++$gid]);
      return "<$s gid=\"$gid\">";
    }
  };

  my $U="#!#";
  (/$U/) and die "sorry, this is a hack.  pick any other unique string than '$U'\n";
  s/<!--(.*?)-->/$U$1$U/gms;  # comments can contain html tags
  s/\<(\/?[a-zA-Z0-9]*)(.*?)\>/$addgid->($1,$2)/emsg;
  s/$U(.*?)$U/<!--$1-->/gms;
  (@xmlstack) and die "unfinished business: ".pop(@xmlstack)->[0]."\n";

  if ($GLOBAL) { my $CHANGED= $_; $_ = $GLOBAL; return $CHANGED; } else { return $_; }
}

sub untokenize { 
  my $GLOBAL; (defined($_[0])) and do { $GLOBAL = $_; $_ = $_[0]; };
  s/ gid="[0-9]+">/>/g; ## buglet: could mistakenly remove gid from inside comments.
  if ($GLOBAL) { my $CHANGED= $_; $_ = $GLOBAL; return $CHANGED; } else { return $_; }
}

################################################################


$_ = "<html>\n<body>\n
<p> <sup>a</sup><sub>b</sub>. </p>.
<hr />
<p> hi<sup>u<sub>ud<sup>udu</sup></sub></sup> </p>.
</body>
</html>
";


tokenize();

## now we can use regex using backreferences
while (/<sup (gid="[0-9]+")>(.*)<\/sup \g1>/gms) {
  print "Example matching of all sup's:  $1 $2\n";  ## could call recursively
}

## another example:  add a class to sup that is immediately followed by a sub
s/\<sup (gid="[0-9]+")\>(.*)<\/sup \g1>\s*\<sub/<sup class="followed" $1>$2<\/sup $1><sup/gms;

print untokenize($_);

this is probably still ignorant of a whole slew of HTML complications, but it can handle a lot of DOM xhtml and xml jobs otherwise not suitable to regex parsing.

`/\<(\/?[a-zA-Z0-9]*)(.*?)\>/` isn't right. The reason to use existing tool isn't because regex are bad, it's because it's a hell of lot more work to write your own parser than to use an existing one (using regex or otherwise). — ikegami, Feb 18 '14 at 03:40
sample of where it can fail on xhtml? I am looking at the html tag list and this should be working on valid html...it is too permissive, of course. — ivo Welch, Feb 18 '14 at 15:09
I explicitly did not state to contradict your claim that it's easier to write your own parser than learning existing one. — ikegami, Feb 18 '14 at 15:11
"**Oh Yes You Can Use Regexes to Parse HTML!**": http://stackoverflow.com/a/4234491/716443 (*But it's a lot harder than most people expect, and probably harder than learning to use a well-documented and tested parser.*) This may also be of interest: http://stackoverflow.com/a/702222/716443 — DavidO, Feb 18 '14 at 16:34

score 0 · Accepted Answer · answered Feb 26 '14 at 06:36

The solution I posted is naive.

Plus:

on random pages, it seems to handle about 9 out of 10 xhtml web pages on the internet. it can handle ordinary stack xhtml files, but can fail on more unusual features (such as DTDs, etc.). if another program generated your xhtml output, it may work all the time.
the learning curve here is about 1/10 compared to real DOM parsing
the code here is about 1/10 the size compared to real DOM parsing.
familiar perl regex knowledge can then be used.
be prepared that this tool is rather limited. if you outgrow its capabilities, you may have to learn a better DOM parser, anyway.

Minus:

it is completely unsuitable if perfect DOM parsing is required. this code is breakable. it follows the berkeley rather than the at&t approach.
but perfect DOM parsers can also fail on bad HTML documents.
and if you already know DOM parsing, then there is little time cost to do it right. use Mojolicious or XML::LibXML. you may as well stick to the better solution then..

giving this code a reflexive -1 vote ignores that it has its uses. sometimes, an ordinary screwdriver can do a job where a philips would be better. this code is an ordinary screwdriver for a philips screw. stackoverflow is a site to which novices come in need of quick solutions, too; not just the experts. this is why I posted it to begin with.

simple improvement fixes are appreciated, though the goal here is explicitly not to deal with all possible valid and invalid, sane and insane, correct and incorrect permutations of xml and xhtml.

/iaw

Regex Parsing for XML or XHTML DOM in Perl

1 Answers1