What are some good ways to parse HTML and CSS in Perl?

Question

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?

The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.

An example of the old XML is:

<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
      h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
      o_size="11.04" o_cs="4.6">
Some text
</text>

An example of the new HTML is:

<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
  <span class="ft19" >
    Some text
  </span></nobr>
</div>

where "ft19" refers to a css style element from the top of the page of the format:

.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
       font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
       x-pdf-letter-spacing:0.83px;}

Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:

my @texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');

as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.

Ideas?

Since I don’t have time to write up a real answer for you, I’ll just comment with a link to something I did awhile back that should address all your needs but you’ll have to dig into it a bit yourself: [Move your CSS from stylesheets to inline with Perl](http://sedition.com/a/156). — Ashley, Feb 17 '11 at 21:03

score 3 · Accepted Answer · edited May 23 '17 at 10:24

The basic one I am aware of is HTML::Parser.

There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.

I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.

If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.

Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.

Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.

Thanks! Will check all of these out. – Eli Feb 17 '11 at 22:46 — Eli, Feb 17 '11 at 22:46

score 0 · Answer 2 · answered Feb 17 '11 at 21:12

0

It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

answered Feb 17 '11 at 21:12

Derek Prior

3,497
1
25
30

No, I'm not doing any conversion to HTML. My project takes input files and does something with them (doesn't matter what). My input files used to be XML, which were easy for me to parse and process. They've now been switched to HTML of the form defined in the question, so I have to change how I parse them. I'm asking about good ways to go about doing that. – Eli Feb 17 '11 at 21:32

What are some good ways to parse HTML and CSS in Perl?

2 Answers2

Linked