Screen scraper script won't write to ouptut file

Question

I can't get the Perl script below to write to the file output.html.

I doesn't need to be a CGI script yet, but that is the ultimate intention.

Can anyone tell me why it isn't writing any text to output.html?

#!/usr/bin/perl

#-----------------------------------------------------------------------
# This script should work as a CGI script, if I get it correctly.
# Most CGI scripts for Perl begin with the same line and must be
# stored in your servers cgi-bin directory. (I think this is set by
# your web server.
#
# This scripts will scrape news sites for stories about topics input
# by the users.
#
# Lara Landis
# Sinister Porpoise Computing
# 1/4/2018
# Personal Perl Project
#-----------------------------------------------------------------------

@global_sites = ();

print( "Starting program.\n" );

if ( !( -e "sitedata.txt" ) ) {
    enter_site_info( @global_sites );
}

if ( !( -e "scrpdata.txt" ) ) {

    print( "scrpdata.txt does not exist. Creating file now.\n" );
    print( "Enter the search words you wish to search for below. Press Ctrl-D to finish.\n" );

    open( SCRAPEFILE, ">scrpdata.txt" );
    while ( $line = <STDIN> ) {
        chop( $line );
        print SCRAPEFILE ( "$line\n" );
    }
    close( SCRAPEFILE );
}

print( "Finished getting site data..." );
scrape_sites( @global_sites );

#----------------------------------------------------------------------
# This routine gets information from the user after the file has been
# created. It also has some basic checking to make sure that the lines
# fed to it are legimate domains.  This is not an exhaustive list of
# all domains in existence.
#----------------------------------------------------------------------
sub enter_site_info {
    my ( @sisites ) = @_;

    $x = 1;

    open( DATAFILE, ">sitedata.txt" ) || die( "Could not open datafile.\n" );
    print( "Enter websites below. Press Crtl-D to finish.\n" );

    while ( $x <= @sisites ) {

        $sisites[$x] = <STDIN>;

        print( "$sisites[$x] added.\n" );
        print DATAFILE ( "$sisites[$x]\n" );

        $x++;
    }

    close( DATAFILE );

    return @sisites;
}

#----------------------------------------------------------------------
# If the file exists, just get the information from it.  Read info in
# from the sites. Remember to create a global array for the sites
# data.
#-----------------------------------------------------------------------

#-----------------------------------------------------------------------
# Get the text to find in the sites that are being scraped. This requires
# nested loops. It starts by going through the loops for the text to be
# scraped, and then it goes through each of the websites listend in the
# sitedata.txt file.
#-----------------------------------------------------------------------
sub scrape_sites {
    my ( @ss_info ) = @_;

    @gsi_info = ();
    @toscrape = ();
    $y        = 1;

    #---------------------------
    # Working code to be altered
    #---------------------------
    print( "Getting site info..." );

    $x = 1;

    open( DATAFILE, "sitedata.txt" ) || die( "Can't open sitedata.txt.txt\n" );
    while ( $gsi_info[$x] = <DATAFILE> ) {

        chop( $gsi_info[$x] );
        print( "$gsi_info[$x]\n" );

        $x++;
    }

    close( DATAFILE );

    open( SCRAPEFILE, "scrpdata.txt" ) || die( "Can't open scrpdata.txt\n" );
    print( "Getting scrape data.\n" );

    $y = 1;

    while ( $toscrape[$y] = <SCRAPEFILE> ) {
        chop( $toscrape[$y] );
        $y++;
    }

    close( SCRAPEFILE );

    print( "Now opening the output file.\n" );

    $z = 1;

    open( OUTPUT, ">output.html" );
    print( "Now scraping sites.\n" );

    while ( $z <= @gsi_info ) {    #This loop contains SITES

        system( "rm -f index.html.*" );
        system( "wget $gsi_info[$z]" );

        $z1 = 1;

        print( "Searching site $gsi_info[$z] for $toscrape[$z1]\n" );
        open( TEMPFILE, "$gsi_info[$z]" );

        $comptext = <TEMPFILE>;

        while ( $comptext =~ /$toscrape[z1]/ig ) {    # This loop fetches data from the search terms

            print( "Now scraping $gsi_info[$z] for $toscrape[$z1]\n" );
            print OUTPUT ( "$toscrape[$z1]\n" );

            $z1++;
        }

        close( TEMPFILE );

        $z++;
    }

    close( OUTPUT );

    return ( @gsi_info );
}

Where are you actually writing to `output.html`? It isn't anywhere in your pl. — Mike Tung, Jan 15 '18 at 01:12
More important, a list is passed by value so `enter_site_info` does not change the value of its argument. You are not scraping anything. — Jim Garrison, Jan 15 '18 at 01:21
If it weren't for the subroutine calls without `&` and the `my` declarations, I'd assume this is perl4 code from 1992 or so. Start with `use strict; use warnings;` and fix all the problems. — melpomene, Jan 15 '18 at 01:36
You should avoid `chop`. `chomp` is much safer. And the first index of a Perl array is zero. You have written *way* too much before stopping to test your code. You should write just a very few lines at a time before running the program to make sure that it works so far. You certainly shouldn't get anywhere near finishing the program before testing unless it is a trivial piece of code just a line or two long. And you should write your code so that the comments are unnecessary. Only particularly tricky techniques need an advisory comment. — Borodin, Jan 15 '18 at 05:26
Because this script has gone through changes. The initial intent was to return the variable values to the main program. And it looks like I got Python and Perl mixed up with chop and chomp. — T. Sinister Porpoise, Jan 15 '18 at 19:40
Also, I'm not a fan of commenting in general, but people expect it. — T. Sinister Porpoise, Jan 15 '18 at 19:40

score 4 · Answer 1 · answered Jan 15 '18 at 02:56

You're making assumptions about the current work directory that are often incorrect. You seem to assume the current work directory is the directory in which the script resides, but that's never guaranteed, and it's often / for CGI scripts.

"sitedata.txt"

should be

use FindBin qw( $RealBin );

"$RealBin/sitedata.txt"

There could also be a permission error. You should include the error cause ($!) in your error message when open fails so you know what is causing the problem!

score 3 · Answer 2 · answered Jan 15 '18 at 02:08

While you're checking some, you're not checking all of your open or system calls. If they fail, the program will keep going without an error message telling you why.

You can add checks to all of these, but it's easy to forget. Instead, use autodie to do the checks for you.

You'll also want to use strict to ensure you haven't made any variable typos, and use warnings to warn you about small mistakes. See this answer for more.

Also @global_sites is empty so enter_site_info() isn't going to do anything. And scrape_sites() does nothing with its argument, @ss_info.

T. Sinister Porpoise · Answer 3 · 2018-01-15T20:08:09.310

All of these things are helpful. Thank you. I found the problem. I was opening the wrong file. It was putting the error-checking in on the file that let me spot the error. It should have been

open (TEMPFILE, "index.html") || die ("Cannot open index.html\n");

I have taken as many of the suggestions as I remembered and included them in the code. I still need to implement the directory advice, but it should not be difficult.

Screen scraper script won't write to ouptut file

3 Answers3