Recursive web download following links according to DOM criteria

Question

To be more precise, the content is organized in a hierarchical manner, but the URLs are not. The URL space is flat, making it look like everything is in the same directory. (In reality, there probably isn't a directory; I guess things are coming out of some other database; but that's not relevant here.)

So if you want to download part of MSDN, say, the NMake manual, you can't just recursively download everything below a given directory. Because that will be all of MSDN. Too much for your hard drive and bandwith.

But you could write a script that looks at the DOM (HTML) to then follow and download only those links contained in certain navigational sections of the document, like those of CSS class attribute toc_children and toc_siblings, but not toc_parent.

What you'd need would be some downloader that allows you to say:

$webclient->add_links( $xpath_expression ); # or
$webclient->add_links( $css_selector );

It shouldn't be too difficult to cobble something together using Perl, LWP and XML::LibXML (HTML parser), but maybe you know of a tool that allows you to do just that so I don't need to reinvent it.

It doesn't have to be Perl, any other language is fine, too, and so is a ready-made program that has the flexibility required for this job.

@ikegami - Being precise, or dense, eh? I wrote "maybe you know of a tool that allows you to do just that", but I admit I forgot the question mark. — Lumi, Apr 09 '12 at 19:10
No, there is no no existing tool that matches your extremely precise custom requirements, and you know that. But yeah, I am being dense. I an intentionally ignoring the only question I do hear ("Can you write my code for me?") for your benefit. — ikegami, Apr 10 '12 at 04:10
@ikegami - "I an intentionally ignoring the only question I do hear ("Can you write my code for me?") for your benefit." - Consider that the questions we *hear* (but which aren't there) might be a consequence of our outlook on life. But then, everyone can just have a bad day or a bad week. Cheers. — Lumi, Apr 10 '12 at 13:59

score 2 · Accepted Answer · answered Apr 09 '12 at 16:17

2

Check out the find_link function (and siblings) from WWW::Mechanize. It can use arbitrary criteria to find links including the "id" and "class" attributes.

answered Apr 09 '12 at 16:17

benrifkah

1,526
14
31

I realize that this doesn't use XPath or CSS selectors but you may not need them to get the job done. – benrifkah Apr 09 '12 at 16:22
Thanks, `WWW::Mechanize` was the first thing I looked at. Unfortunately, its link spec does not take the position of the link in the DOM into account; it only looks at the link tag and doesn't have any information concerning the tag's place in the doc. It uses [HTML::Parser](https://metacpan.org/module/HTML::Parser), which doesn't build a DOM, so the info I need is not there. Thanks anyway. – Lumi Apr 09 '12 at 17:08
I considered that but it wasn't 100% clear to me without a sample of the HTML that you needed to restrict the link search to subsets of the DOM. Check out the [script Adam Gotch made](http://perlbuzz.com/2011/11/finding-a-lost-dogs-owner-with-perl-and-wwwmechanize.html) to combine WWW::Mechanize with [HTML::TreeBuilder::XPath](https://metacpan.org/module/HTML::TreeBuilder::XPath) – benrifkah Apr 09 '12 at 17:31

score 2 · Answer 2 · answered Apr 10 '12 at 00:01

2

Mojo::UserAgent returns stuff that understands CSS3 selectors or XPath. For instance, I just showed an example in Painless RSS processing with Mojo. I'm really enjoying this new(ish) web client stuff. Most everything I want is already there (no additional modules) and it's integrated very well.

answered Apr 10 '12 at 00:01

brian d foy

129,424
31
207
592

Thanks - hadn't thought of using `Mojo::UserAgent` in a standalone fashion. But yes, why not? – Lumi Apr 10 '12 at 14:01

score 1 · Answer 3 · answered Apr 09 '12 at 17:26

This might get you started in the right direction or lead you astray. Note that I first saved the page to a local file so as not to constantly download it while I was working on it.

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

$tree->parse_file('nmake-ref.html');

my @links = map { { $_->as_text => $_->attr('href') } }
            $tree->findnodes(q{//div[@class='sectionblock']/*/a});

for my $link (@links) {
    my ($entry, $url) = %{ $link };
    ($link->{ file } = "$entry.html" ) =~ s/[^A-Za-z_0-9.]+/_/g;
    system wget => qq{'$url'}, '-O', $link->{ file };
}

Recursive web download following links according to DOM criteria

3 Answers3

Linked