Perl CGI produces unexpected output

Question

I have a Perl CGI script for online concordance application that searches for an instance of word in a text and prints the sorted output.

#!/usr/bin/perl -wT

# middle.pl - a simple concordance



# require
use strict;
use diagnostics;
use CGI;


# ensure all fatals go to browser during debugging and set-up
# comment this BEGIN block out on production code for security
BEGIN {
    $|=1;
    print "Content-type: text/html\n\n";
    use CGI::Carp('fatalsToBrowser');
}

# sanity check
my $q = new CGI;
my $target = $q->param("keyword");
my $radius = $q->param("span");
my $ordinal = $q->param("ord");
my $width = 2*$radius;
my $file    = 'concordanceText.txt';
if ( ! $file or ! $target ) {

    print "Usage: $0 <file> <target>\n";
    exit;
    
}

# initialize
my $count   = 0;
my @lines   = ();
$/          = ""; # Paragraph read mode

# open the file, and process each line in it
open(FILE, " < $file") or die("Can not open $file ($!).\n");
while(<FILE>){

    # re-initialize
    my $extract = '';
    
    # normalize the data
    chomp;
    s/\n/ /g;        # Replace new lines with spaces
    s/\b--\b/ -- /g; # Add spaces around dashes

    # process each item if the target is found
    while ( $_ =~ /\b$target\b/gi ){
                
        # find start position
        my $match = $1;
        my $pos   = pos;
        my $start = $pos - $radius - length($match);

        # extract the snippets
        if ($start < 0){
            $extract = substr($_, 0, $width+$start+length($match));
            $extract = (" " x -$start) . $extract;
        }else{
            $extract = substr($_, $start, $width+length($match));
            my $deficit = $width+length($match) - length($extract);
            if ($deficit > 0) {
                $extract .= (" " x $deficit);
            }
    
        }

        # add the extracted text to the list of lines, and increment
        $lines[$count] = $extract;
        ++$count;
        
    }
    
}

sub removePunctuation {
    my $string = $_[0];
    $string = lc($string); # Convert to lowercase
    $string =~ s/[^-a-z ]//g; # Remove non-aplhabetic characters 
    $string =~ s/--+/ /g; #Remove 2+ hyphens with a space 
    $string =~s/-//g; # Remove hyphens
    $string =~ s/\s=/ /g;
    return($string);
    
}

sub onLeft {
    #USAGE: $word = onLeft($string, $radius, $ordinal);
    my $left = substr($_[0], 0, $_[1]);
    $left = removePunctuation($left);
    my @word = split(/\s+/, $left);
    return($word[-$_[2]]);
}

sub byLeftWords {
    my $left_a = onLeft($a, $radius, $ordinal);
    my $left_b = onLeft($b, $radius, $ordinal);
    lc($left_a) cmp lc($left_b);
}


# process each line in the list of lines

print "Content-type: text/plain\n\n";
my $line_number = 0;
foreach my $x (sort byLeftWords @lines){
    ++$line_number;
    printf "%5d",$line_number;
    print " $x\n\n";
}

# done
exit;

The perl script produces expected result in terminal (command line). But the CGI script for online application produces unexpected output. I cannot figure out what mistake I am making in the CGI script. The CGI script should ideally produce the same output as the command line script. Any suggestion would be very helpful.

Command Line Output

CGI Output

Do not post images for text data such as input or output. Always copy/paste the data into your post and format as code. — Jim Garrison, Jan 08 '17 at 06:49

Jim Garrison · Answer 1 · 2017-01-08T07:37:42.517

The BEGIN block executes before anything else and thus before

my $q = new CGI;

The output goes to the server process' stdout and not to the HTTP stream, so the default is text/plain as you can see in the CGI output.

After you solve that problem you'll find that the output still looks like a big ugly block because you need to format and send a valid HTML page, not just a big block of text. You cannot just dump a bunch of text to the browser and expect it to do anything intelligent with it. You must create a complete HTML page with tags to layout your content, probably with CSS as well.

In other words, the output required will be completely different from the output when writing only to the terminal. How to structure it is up to you, and explaining how to do that is out of scope for StackOverflow.

Actually they can very well dump a bunch of text in the browser with `text/plain`. It will look like a bunch of text, which seems to be exactly what that program is supposed to do. — simbabque, Jan 08 '17 at 09:33

score 3 · Answer 2 · edited May 23 '17 at 12:16

As the other answers state, the BEGIN block is executed at the very start of your program.

BEGIN {
    $|=1;
    print "Content-type: text/html\n\n";
    use CGI::Carp('fatalsToBrowser');
}

There, you output an HTTP header Content-type: text/html\n\n. The browser sees that first, and treats all your output as HTML. But you only have text. Whitespace in an HTML page is collapsed into single spaces, so all your \n line breaks disappear.

Later, you print another header, the browser cannot see that as a header any more, because you already had one and finished it off with two newlines \n\n. It's now too late to switch back to text/plain.

It is perfectly fine to have a CGI program return text/plain and just have text without markup be displayed in a browser when all you want is text, and no colors or links or tables. For certain use cases this makes a lot of sense, even if it doesn't have the hyper in Hypertext any more. But you're not really doing that.

Your BEGIN block serves a purpose, but you are overdoing it. You're trying to make sure that when an error occurs, it gets nicely printed in the browser, so you don't need to deal with the server log while developing.

The CGI::Carp module and it's functionality fatalsToBrowser bring their own mechanism for that. You don't have to do it yourself.

You can safely remove the BEGIN block and just put your use CGI::CARP at the top of the script with all the other use statements. They all get run first anyway, because use gets run at compile time, while the rest of your code gets run at run time.

If you want, you can keep the $|++, which turns off the buffering for your STDOUT handle. It gets flushed immediately and every time you print something, that output goes directly to the browser instead of collecting until it's enough or there is a newline. If your process runs for a long time, this makes it easier for the user to see that stuff is happening, which is also useful in production.

The top of your program should look like this now.

#!/usr/bin/perl -T

# middle.pl - a simple concordance
use strict;
use warnigns;
use diagnostics;
use CGI;
use CGI::Carp('fatalsToBrowser');

$|=1;

my $q = CGI->new;

Finally, a a few quick words on the other parts I deleted from there.

Your comment requires over the use statements is misleading. Those are use, not require. As I said above, use gets run at compile time. require on the other hand gets run at run time and can be done conditionally. Misleading comments will make it harder for others (or you) to maintain your code later on.
I removed the -w flag from the shebang (#!/usr/bin/perl) and put the use warnings pragma in. That's a more modern way to turn on warnings, because sometimes the shebang can be ignored.
The use diagnostics pragma gives you extra long explanations when things go wrong. That's useful, but also extra slow. You can use it during development, but please remove it for production.
The comment sanity check should be moved down under the CGI instantiation.
Please use the invocation form of new to instantiate CGI, and any other classes. The -> syntax will take care of inheritance properly, while the old new CGI cannot do that.

Thanks a lot for your suggestion. The CGI now produces the expected output...! I appreciate your help. — Deep Shah, Jan 11 '17 at 16:06
@DeepShah please see [the help section](http://stackoverflow.com/help/someone-answers) to learn what to do when your question gets answered. — simbabque, Feb 27 '17 at 08:54

score 2 · Answer 3 · answered Jan 08 '17 at 08:38

2

I ran your cgi. The BEGIN block is run regardless and you print a content-type header here - you have explicitly asked for HTML here. Then later you attemp to print another header for PLAIN. This is why you can see the header text (that hasn't taken effect) at the beginning of the text in the browser window.

answered Jan 08 '17 at 08:38

Duncan Carr

250
1
5

Thanks for your suggestion. – Deep Shah Jan 11 '17 at 16:07

Perl CGI produces unexpected output

Command Line Output

CGI Output

3 Answers3