0

MSDN is a huge hierarchical doc site.

To be more precise, the content is organized in a hierarchical manner, but the URLs are not. The URL space is flat, making it look like everything is in the same directory. (In reality, there probably isn't a directory; I guess things are coming out of some other database; but that's not relevant here.)

So if you want to download part of MSDN, say, the NMake manual, you can't just recursively download everything below a given directory. Because that will be all of MSDN. Too much for your hard drive and bandwith.

But you could write a script that looks at the DOM (HTML) to then follow and download only those links contained in certain navigational sections of the document, like those of CSS class attribute toc_children and toc_siblings, but not toc_parent.

What you'd need would be some downloader that allows you to say:

$webclient->add_links( $xpath_expression ); # or
$webclient->add_links( $css_selector );

It shouldn't be too difficult to cobble something together using Perl, LWP and XML::LibXML (HTML parser), but maybe you know of a tool that allows you to do just that so I don't need to reinvent it.

It doesn't have to be Perl, any other language is fine, too, and so is a ready-made program that has the flexibility required for this job.

Community
  • 1
  • 1
Lumi
  • 14,775
  • 8
  • 59
  • 92
  • 1
    You seem to have forgotten to ask a question. – ikegami Apr 09 '12 at 18:33
  • 1
    @ikegami - Being precise, or dense, eh? I wrote "maybe you know of a tool that allows you to do just that", but I admit I forgot the question mark. – Lumi Apr 09 '12 at 19:10
  • No, there is no no existing tool that matches your extremely precise custom requirements, and you know that. But yeah, I am being dense. I an intentionally ignoring the only question I do hear ("Can you write my code for me?") for your benefit. – ikegami Apr 10 '12 at 04:10
  • @ikegami - "I an intentionally ignoring the only question I do hear ("Can you write my code for me?") for your benefit." - Consider that the questions we *hear* (but which aren't there) might be a consequence of our outlook on life. But then, everyone can just have a bad day or a bad week. Cheers. – Lumi Apr 10 '12 at 13:59

3 Answers3

2

Check out the find_link function (and siblings) from WWW::Mechanize. It can use arbitrary criteria to find links including the "id" and "class" attributes.

benrifkah
  • 1,526
  • 14
  • 31
  • I realize that this doesn't use XPath or CSS selectors but you may not need them to get the job done. – benrifkah Apr 09 '12 at 16:22
  • Thanks, `WWW::Mechanize` was the first thing I looked at. Unfortunately, its link spec does not take the position of the link in the DOM into account; it only looks at the link tag and doesn't have any information concerning the tag's place in the doc. It uses [HTML::Parser](https://metacpan.org/module/HTML::Parser), which doesn't build a DOM, so the info I need is not there. Thanks anyway. – Lumi Apr 09 '12 at 17:08
  • I considered that but it wasn't 100% clear to me without a sample of the HTML that you needed to restrict the link search to subsets of the DOM. Check out the [script Adam Gotch made](http://perlbuzz.com/2011/11/finding-a-lost-dogs-owner-with-perl-and-wwwmechanize.html) to combine WWW::Mechanize with [HTML::TreeBuilder::XPath](https://metacpan.org/module/HTML::TreeBuilder::XPath) – benrifkah Apr 09 '12 at 17:31
2

Mojo::UserAgent returns stuff that understands CSS3 selectors or XPath. For instance, I just showed an example in Painless RSS processing with Mojo. I'm really enjoying this new(ish) web client stuff. Most everything I want is already there (no additional modules) and it's integrated very well.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Thanks - hadn't thought of using `Mojo::UserAgent` in a standalone fashion. But yes, why not? – Lumi Apr 10 '12 at 14:01
1

This might get you started in the right direction or lead you astray. Note that I first saved the page to a local file so as not to constantly download it while I was working on it.

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

$tree->parse_file('nmake-ref.html');

my @links = map { { $_->as_text => $_->attr('href') } }
            $tree->findnodes(q{//div[@class='sectionblock']/*/a});

for my $link (@links) {
    my ($entry, $url) = %{ $link };
    ($link->{ file } = "$entry.html" ) =~ s/[^A-Za-z_0-9.]+/_/g;
    system wget => qq{'$url'}, '-O', $link->{ file };
}
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339