Data Extraction from HTML using Perl

Question

I am in the initial stage of my college mini-project and am stuck.

Can anybody please let me know the basic and advanced concepts and ways of "Extracting Data from an HTML page Using Perl" along with the code?

If not yjrm please show me the path to go through the resources related to the concept so that I can learn on my own.

possible duplicate of [Grep and Extract Data in Perl](http://stackoverflow.com/questions/2886200/grep-and-extract-data-in-perl) — MarmiK, Jul 21 '14 at 06:58
I don't think the question is too broad. Faced with the amount of modules on CPAN, a question like "it's July 2014, where do I start?" is a perfectly legit one. The answers narrow down the list of module documentation to read to the ones that are relevant, maintained and generally accepted by the community at this point in time. — mirod, Jul 22 '14 at 18:50

score 2 · Accepted Answer · answered Jul 21 '14 at 07:17

This should get you started.

#!/usr/bin/perl

use strict;
use warnings;
use autodie;
use LWP::Simple; #For getting a websites HTML also see LWP::UserAgent
use HTML::Tree; #Use a parser to parse HTML, read the docs on CPAN


#Use LWP get a page's contents
#We'll use the url to this question http://stackoverflow.com/questions/24858906/data-extraction-from-html-using-perl
my $url = "http://stackoverflow.com/questions/24858906/data-extraction-from-html-using-perl";


#All the html will be in content
my $content = get($url);

my $p = HTML::Tree->new();

#parse the string in $content. You can also parse_from_file or parse_from_url
#Though for learning sake you should get used to LWP
$p->parse($content);

#Check HTML::Element documentation for the data manipulation part
my $post = $p->find_by_attribute('class', 'post-text');

#Should print your question out.
print $post->as_text();

Now review the documentation for:

thank ya Gabs.,let me try it out ones..What i actually wanted was to extract particular fields from the HTML page and put that in a database. — Nishi Bangar, Jul 22 '14 at 18:09
@Nishi Do you mean data placed into input fields? That you'd pass with post or get requests? — Gabs00, Jul 22 '14 at 23:00

score 0 · Answer 2 · answered Jul 21 '14 at 07:07

0

Depending on what you need to do exactly you may want to look at HTML::TreeBuilder (and the extension HTML::TreeBuilder::XPath to get XPath goodness) or, if you need to interact with websites, WWW::Mechanize.

An other fairly popular tool is Mojo::DOM.

answered Jul 21 '14 at 07:07

mirod

15,923
3
45
65

Data Extraction from HTML using Perl

2 Answers2