HTML parsing in perl

Question

I'm trying to parse the following HTML structure with in perl. I need to select all of the dd elements that contain the class message and also an id. All I would like the script to do is loop through all of the dd elements and print out the id of the dd element but it needs to ignore the first dd element as that is static and will not change.

It can be with any perl module as long as it can be installed from cpan to make it easy for me. I don't have much experience with perl and parsing html so any pointers would be very helpful.

Thanks :)

HTML Structure:

<pre><code>
<html>
<head>
</head>
<body>
 .....other elements
    <div id="messages">
        <div class="header"></div>
        <dl>
            <dd class="message unread mc-friend mc-message">This is just a random message, do not parse</dd>
            <dd id="msg2" class="message unread mc-message">
                Hello
            </div>
            <dd id="msg3" class="message unread mc-message">
                Hello
            </dd>
        </dl>
    </div>
</body>
</html>
</pre></code>

:) in general HTML::Parser is great, but you may have specific needs that point you somewhere else... there is also a goodly archive of similar questions here that may give you some useful tips. — Ether, Jan 04 '11 at 21:20

score 23 · Accepted Answer · answered Jan 04 '11 at 21:02

23

Something like this, quick and easy:

#! /usr/bin/perl
use strict;
use warnings;

use Mojo::DOM;

my $html = "Your HTML goes here";

my $dom = Mojo::DOM->new;
$dom->parse($html);
my $skip;
for my $dd ($dom->find('dd[class*="message"]')->each) {
    print $dd->attrs->{id}, "\n" if $skip++;
}

answered Jan 04 '11 at 21:02

Grrrr

2,477
20
21

1

Pefect, Mojo::DOM is exactly what I want. :D – Jack Jan 04 '11 at 21:08

score 8 · Answer 2 · edited Jun 21 '12 at 13:52

8

Have a look at HTML::Parser or better yet HTML::TreeBuilder

HTML parsing in perl

2 Answers2

Linked