How does one -- in Perl -- stream a list of URLs from a file into an array to then recursively acquire all of their HTML data in a single file?

Question

Another laborious title... Sorry... Anyway, I've got a file called mash.txt with a bunch of URLs like this in it:

http://www...

.

So, at this point, I'd like to feed these (URLs) into an array--possibly without having to declare anything along the way--to then recursively suck up the HTML data from each one and append it all to the same file--which I guess will have to be created... Anyhow, thanks in advance.

Actually, to be completely forthcoming, by design I'd like to match the values (value) under the option tags in each HTML tag to this document, so I don't have all that garbage... That is, each of these

http://www...

will produce something like this

<!DOCTYPE html>
<HTML>
   <HEAD>
      <TITLE>
         DATA! 
      </TITLE>
   </HEAD>
<BODY>
.
.
.

All I want out of all of these is the value name under the option tag that occurs in each HTML in this mash.txt.

Could you give an example of this mysterious `value` tag? Are you implying that there is only ONE of these tags per HTML source? Also, in the spirit of SO, what have you written so far and what are you stuck on? Please show that you've done some work on this. — jimtut, Mar 04 '14 at 00:26

Chris · Accepted Answer · 2014-03-04T02:20:44.460

1

The following fetches the HTML content for each URL in mash.txt, retrieves all values across all options, and pushes them into a single array. The resultant array is then passed to input.template, and the processed output is written to output.html:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
use Template;

my %values;
my $input_file     = 'mash.txt';
my $input_template = 'input.template';
my $output_file    = 'output.html';

# create a new lwp user agent object (our browser).
my $ua = LWP::UserAgent->new( );

# open the input file (mash.txt) for reading.
open my $fh, '<', $input_file or die "cannot open '$input_file': $!";

# iterate through each line (url) in the input file.
while ( my $url = <$fh> )
{
    # get the html contents from url. It returns a handy response object.
    my $response = $ua->get( $url );

    # if we successfully got the html contents from url.
    if ( $response->is_success ) 
    {
        # create a new html tree builder object (our html parser) from the html content.
        my $tb = HTML::TreeBuilder->new_from_content( $response->decoded_content );

        # fetch values across options and push them into the values array.
        # look_down returns an array of option node objects, which we translate to the value of the value attribute via attr upon map.
        $values{$_} = undef for ( map { $_->attr( 'value' ) } $tb->look_down( _tag => 'option' ) );
    }
    # else we failed to get the html contents from url.
    else 
    {
        # warn of failure before next iteration (next url).
        warn "could not get '$url': " . $response->status_line;
    }
}

# close the input file since we have finished with it.
close $fh;

# create a new template object (our output processor).
my $tp = Template->new( ) || die Template->error( );

# process the input template (input.template), passing in the values array, and write the result to the output file (output.html).
$tp->process( $input_template, { values => [ keys %values ] }, $output_file ) || die $tp->error( );

__END__

input.template could look something like:

<ul>
[% FOREACH value IN values %]
    <li>[% value %]</li>
[% END %]
</ul>

edited Mar 04 '14 at 02:20

answered Mar 04 '14 at 00:53

Chris

729
6
12

For a beginner, I'd sure love it if your code had comments. I don't mind having my hand held--figuratively (in this particular instance). – user3333975 Mar 04 '14 at 01:53
Hmmm... This is good... How can I filter repeats using your method? – user3333975 Mar 04 '14 at 02:02
Could you explain in more detail what you mean by filter repeats, is it that you want to remove duplicate values? – Chris Mar 04 '14 at 02:14
Yeah, just in case that happens. I don't think it will, but you never know. – user3333975 Mar 04 '14 at 02:16
I have updated my post after adjusting it to use a hash instead of an array, effectively removing duplicates on the fly. – Chris Mar 04 '14 at 02:22
How about this, I'll show you what I'm working on. I'm new to Perl, so don't laugh if you know of a quicker way. I kind of just want to do it this way. Here is the link: http://pastebin.com/6wQ2FaDw – user3333975 Mar 04 '14 at 02:22
I don't want to use Template. What is `` exactly? – user3333975 Mar 04 '14 at 02:23
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/48910/discussion-between-chris-and-user3333975) – Chris Mar 04 '14 at 02:25
Anyway, my end goal--hopefully it's clear from the code (feel free to try it)--is to get all of [these](http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=13F&subareasel=BIOENGR&idxcrs=0165EW++) by sucking up all the `value` entries in all pages of [this](http://www.registrar.ucla.edu/schedule/crsredir.aspx?termsel=13F&subareasel=BIOENGR) form. I got all the subject area `value` entries from [here](http://www.registrar.ucla.edu/schedule/schedulehome.aspx) already. – user3333975 Mar 04 '14 at 02:29

How does one -- in Perl -- stream a list of URLs from a file into an array to then recursively acquire all of their HTML data in a single file?

1 Answers1