How get data from a web page and save it with Perl?

Question

I want to write a program that:

Connects to a blogging service and gives the names of recently updated blogs.
Save the blog names in a text file.

How can I connect to a web page and read data from it? And save that data?

score 4 · Answer 1 · answered Dec 06 '13 at 20:38

Perl has various web suites for slightly different tasks. You can consider using LWP::UserAgent + HTML::Tree, Web::Query and Mojo. I would prefer Mojo.

Once we have the page, we can use CSS selectors to extract the data we are interested in. Here, I look at new perl questions:

use strict;    # safety net
use warnings;  # safety net
use feature 'say'; # a better "print"

use Mojo;

# fetch the stackoverflow perl page

my $ua = Mojo::UserAgent->new;
my $perl_page = $ua->get('http://stackoverflow.com/questions/tagged/perl')->res->dom;

# extract all questions:

my $questions = $perl_page->at('#questions');
for my $question ($questions->find('h3 > a')->each) {
  say $question->all_text;
  say "  <", $question->attr('href'), ">";
}

Output:

Perl script, parse text file between words
  </questions/20432447/perl-script-parse-text-file-between-words>
Having issues with Spreadsheet::WriteExcel that makes me run the script twice to get desired file
  </questions/20432157/having-issues-with-spreadsheetwriteexcel-that-makes-me-run-the-script-twice-to>
Calculate distance between a single atom and other atoms in a pdb file; print issue
  </questions/20431884/calculate-distance-between-a-single-atom-and-other-atoms-in-a-pdb-file-print-is>
Exit status of child spawned in a pipe
  </questions/20431810/exit-status-of-child-spawned-in-a-pipe>
How get data from a web page and save it with perl?
  </questions/20431443/how-get-data-from-a-web-page-and-save-it-with-perl>
GatoIcon.py automatically generated <?> from images via perl?
  </questions/20431389/gatoicon-py-automatically-generated-from-images-via-perl>
How and when can I use PPMs that weren't built in in ActivePerl 5.18?
  </questions/20430599/how-and-when-can-i-use-ppms-that-werent-built-in-in-activeperl-5-18>
Translating perl to python - What does this line do (class variable confusion)
  </questions/20429516/translating-perl-to-python-what-does-this-line-do-class-variable-confusion>
Fix files “corrupted” by Perl
  </questions/20427916/fix-files-corrupted-by-perl>
how to add slash separator in perl
  </questions/20427499/how-to-add-slash-separator-in-perl>
negative look ahead on whole number but preceded by a character(perl)
  </questions/20426507/negative-look-ahead-on-whole-number-but-preceded-by-a-characterperl>
Use variable expansion in heredoc while piping data to gnuplot
  </questions/20426379/use-variable-expansion-in-heredoc-while-piping-data-to-gnuplot>
How do I create multiple database connections in Catalyst with DBIC
  </questions/20425107/how-do-i-create-multiple-database-connections-in-catalyst-with-dbic>
Moose's attribute vs simple sub?
  </questions/20424929/mooses-attribute-vs-simple-sub>
How to use unicode in perl CGI param
  </questions/20424488/how-to-use-unicode-in-perl-cgi-param>

score 2 · Answer 2 · edited Oct 02 '19 at 20:12

2

You need to load a library to connect to another server and open a file to write/print to it:

use LWP::Simple qw /get/;
my $content = get $url;

open (MYFILE, '>>data.txt');
print MYFILE $content;
close (MYFILE);

A Windows help file formatted ebook of the Perl manual is located at https://code.google.com/p/htmlhelp/downloads/detail?name=perl-5.10.0.chm.

edited Oct 02 '19 at 20:12

Ωmega

42,614
34
134
203

answered Dec 06 '13 at 19:02

Wayne

4,760
1
24
24

Thanks for your answer, but I don't want to save all the web page. I want to just save blog names. What should I do ? – Javad MKoushyar Dec 06 '13 at 19:05
parse the name from the $content normally with regular expressions. $content =~ m/([a-zA-Z\/][^>]+)<\/title>/si; or parse whatever the information you want/need to extract to process. – Wayne Dec 06 '13 at 19:12
3

1) Always `use strict; use warnings;`, 2) Use the [three-argument form of `open`](http://modernperlbooks.com/mt/2010/04/three-arg-open-migrating-to-modern-perl.html), 3) `>>` means append; use `>`, in this case, 4) See the canonical [You can't parse HTML with regex. Because HTML can't be parsed by regex.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Kenosis Dec 06 '13 at 19:42

score 2 · Answer 3 · answered Dec 06 '13 at 20:38

You can use WWW::Mechanize to access web page content and even login and navigate through several web pages:

use WWW::Mechanize;
    my $mech = WWW::Mechanize->new();

    $mech->get( $url );

    $mech->follow_link( n => 3 );
    $mech->follow_link( text_regex => qr/download this/i );
    $mech->follow_link( url => 'http://host.com/index.html' );

    $mech->submit_form(
        form_number => 3,
        fields      => {
            username    => 'mungo',
            password    => 'lost-and-alone',
        }
    );

    $mech->submit_form(
        form_name => 'search',
        fields    => { query  => 'pot of gold', },
        button    => 'Search Now'
    );

    # get all textarea controls whose names begin with "customer"
    my @customer_text_inputs = $mech->find_all_inputs(
        type       => 'textarea',
        name_regex => qr/^customer/,
    );

    # get all text or textarea controls called "customer"
    my @customer_text_inputs = $mech->find_all_inputs(
        type_regex => qr/^(text|textarea)$/,
        name       => 'customer',
    );

What about cookies? Is that automatically taken care when logging in (using $mech->submit_form())? — Peter Mortensen, Apr 16 '15 at 19:04
@PeterMortensen Yes, http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm#new() — alex, Apr 17 '15 at 21:00

How get data from a web page and save it with Perl?

3 Answers3