18

I'm trying to parse the following HTML structure with in perl. I need to select all of the dd elements that contain the class message and also an id. All I would like the script to do is loop through all of the dd elements and print out the id of the dd element but it needs to ignore the first dd element as that is static and will not change.

It can be with any perl module as long as it can be installed from cpan to make it easy for me. I don't have much experience with perl and parsing html so any pointers would be very helpful.

Thanks :)

HTML Structure:

<pre><code>
<html>
<head>
</head>
<body>
 .....other elements
    <div id="messages">
        <div class="header"></div>
        <dl>
            <dd class="message unread mc-friend mc-message">This is just a random message, do not parse</dd>
            <dd id="msg2" class="message unread mc-message">
                Hello
            </div>
            <dd id="msg3" class="message unread mc-message">
                Hello
            </dd>
        </dl>
    </div>
</body>
</html>
</pre></code>
Erik
  • 20,526
  • 8
  • 45
  • 76
Jack
  • 3,769
  • 6
  • 24
  • 32
  • 1
    :) in general HTML::Parser is great, but you may have specific needs that point you somewhere else... there is also a goodly archive of similar questions here that may give you some useful tips. – Ether Jan 04 '11 at 21:20

2 Answers2

23

Something like this, quick and easy:

#! /usr/bin/perl
use strict;
use warnings;

use Mojo::DOM;

my $html = "Your HTML goes here";

my $dom = Mojo::DOM->new;
$dom->parse($html);
my $skip;
for my $dd ($dom->find('dd[class*="message"]')->each) {
    print $dd->attrs->{id}, "\n" if $skip++;
}
Grrrr
  • 2,477
  • 20
  • 21
8

Have a look at HTML::Parser or better yet HTML::TreeBuilder

More on TreeBuilder.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
  • 1
    I'll toss in XML::LibXML with XPath selectors, but I do prefer the CSS Selectors of Web::Query and Mojo::DOM. – Dave Jacoby Jul 25 '15 at 18:09