Extracting HTML from url using Perl

Question

I want to extract the HTML code of a TWiki (who's URL i have). What is the best possible way of doing that?

Additionally, once i extract the HTML code i need to out it in a site hosted on Google Sites. Is that possible?

Thanks. The LWP::Simple worked fine. But would anyone have any clue to the answer for my second question. I can't seem to access my site at all. — user2590739, Jul 17 '13 at 12:40

score 2 · Answer 1 · edited Jul 17 '13 at 09:49

2

A very simple way to get a HTML page is the LWP::Simple module. If you have to do a more complex navigation flow, then use WWW::Mechanize. Then, if you need to parse the HTML code, the @brian solution is good.

edited Jul 17 '13 at 09:49

daxim

39,270
4
65
132

answered Jul 17 '13 at 09:40

Miguel Prz

13,718
29
42

score 1 · Answer 2 · answered Jul 17 '13 at 09:38

1

Sounds like you need the CPAN HTML::Parser module.

use HTML::Parser ();

 # Create parser object
 $p = HTML::Parser->new( api_version => 3,
                         start_h => [\&start, "tagname, attr"],
                         end_h   => [\&end,   "tagname"],
                         marked_sections => 1,
                       );
# Parse directly from file
 $p->parse_file("foo.html");

answered Jul 17 '13 at 09:38

Brian Agnew

268,207
37
334
440

I don't recommend HTML::Parser, that module needs an annoying amount of code to achieve simple things. Better and declarative: [Web::Query](http://p3rl.org/Web::Query) (CSS selectors), [HTML::TreeBuilder::XPath](http://p3rl.org/HTML::TreeBuilder::XPath) (XPath) – daxim Jul 17 '13 at 09:48

Extracting HTML from url using Perl

2 Answers2