HTML Treebuilder XPath to Extract Links

Question

I am writing a basic script which just extracts all the links from a web page. It is written in Perl and makes use of WWW::Mechanize and HTML::Treebuilder::Xpath modules, both of which I have installed through CPAN.

I know it can be easily done using only WWW::Mechanize, however would like to learn to do it using XPath as well.

So, the script will parse the entire web page, and check the href attribute for every anchor tag, extract the link and print it to the console/write it to a file. Please note that in the script below, I have not used use strict, since I am only writing this to clarify and understand the concept of using XPath to traverse the HTML Tree.

here is the script:

#! /usr/bin/perl

use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use warnings;

$url="https://example.com";

$mech=WWW::Mechanize->new();
$mech->get($url);

$tree=HTML::TreeBuilder::XPath->new();

$tree->parse($mech->content);

$nodes=$tree->findnodes(q{'//a'}); # line is modified later.

foreach $node($nodes)
{
    print $node->attr('href');
}

And it gives an error:

Can't locate object method "attr" via package "XML::XPathEngine::Literal" at pagegetter.pl line 23.

I have modified the script as follows:

$nodes=$tree->findnodes(q{'//a/@href'});

while($node=$nodes->shift)
{
  print $node->attr('href');
}

Error:

Can't locate object method "shift" via package "XML::XPathEngine::Literal"

I am not sure, how to print the value of the href attribute.

$nodes should hold the list of all the href attributes? I believe it does not store the value but instead pointers to it?

I tried searching and reading examples, however I am not sure how to go about it.

Thanks.

You should *always* `use strict`, no matter how trivial your program. It is arguably more important that `use warnings` that you have chosen to use. — Borodin, Jul 31 '12 at 13:18

score 4 · Accepted Answer · answered Jul 31 '12 at 13:07

4

There are a couple of mistakes. Repairs:

# list context
my @nodes = $tree->findnodes(
    q{//a}       # just a string, not a string containings quotes
);

# iterate over array
for my $node (@nodes) {

answered Jul 31 '12 at 13:07

daxim

39,270
4
65
132

You should use an XPath of `//a[@href]` to find all the anchor elements with a `href` attribute – Borodin Jul 31 '12 at 13:19
Thanks. Agreed, but what exactly do you print inside the For Loop? And yes, I want to extract the links? – Neon Flash Jul 31 '12 at 13:23
@NeonFlash: The rest of your code remains as it is. Just `print $node->attr('href'), "\n"` – Borodin Jul 31 '12 at 13:30
@NeonFlash : Putting it all together: `print $_->attr( 'href' ), "\n" for $tree->findnodes( '//a[@href]' );` – Zaid Jul 31 '12 at 13:47
Thanks. Is there a way to extract the innerHTML between the anchor tags and then print it? I used, $node->text(); with the above script and it does not work. I did not want to start another Question just for this, so asking here itself. – Neon Flash Jul 31 '12 at 14:45
I tried, @nodes=$tree->findnodes(q{//a/text()}); for $node(@nodes) { print $node."\n";}, It prints the hash value instead of the innerHTML between the anchor tags. I believe these hash values are the memory addresses where these values are stored. Is there a way to print them? Example: HTML::TreeBuilder::XPath::TextNode=HASH(0xa507f28) – Neon Flash Jul 31 '12 at 14:57
Ok, I got the answer. we need to print it this way, print $node->string_value."\n"; This works :) – Neon Flash Jul 31 '12 at 15:43

HTML Treebuilder XPath to Extract Links

1 Answers1