2

I want to extract the HTML code of a TWiki (who's URL i have). What is the best possible way of doing that?

Additionally, once i extract the HTML code i need to out it in a site hosted on Google Sites. Is that possible?

  • Thanks. The LWP::Simple worked fine. But would anyone have any clue to the answer for my second question. I can't seem to access my site at all. – user2590739 Jul 17 '13 at 12:40

2 Answers2

2

A very simple way to get a HTML page is the LWP::Simple module. If you have to do a more complex navigation flow, then use WWW::Mechanize. Then, if you need to parse the HTML code, the @brian solution is good.

daxim
  • 39,270
  • 4
  • 65
  • 132
Miguel Prz
  • 13,718
  • 29
  • 42
1

Sounds like you need the CPAN HTML::Parser module.

use HTML::Parser ();

 # Create parser object
 $p = HTML::Parser->new( api_version => 3,
                         start_h => [\&start, "tagname, attr"],
                         end_h   => [\&end,   "tagname"],
                         marked_sections => 1,
                       );
# Parse directly from file
 $p->parse_file("foo.html");
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
  • I don't recommend HTML::Parser, that module needs an annoying amount of code to achieve simple things. Better and declarative: [Web::Query](http://p3rl.org/Web::Query) (CSS selectors), [HTML::TreeBuilder::XPath](http://p3rl.org/HTML::TreeBuilder::XPath) (XPath) – daxim Jul 17 '13 at 09:48