Matching to find a line and then extracting certain elements from the line w regex using perl

Question

I am writing a webscraper in perl. I am having troubles trying to extract what I want from the data that is returned in the get("url"); function. I want find a particular line with a regex and then use another regex to match and store the matches in an array. If someone could give me an example that would be super helpful.

#!/usr/bin/perl

use LWP::Simple;

$regex  = m/Prerequisite:.[A-Z]{4}[0-9]{4}/g;
$regex2 = m/[A-Z]{4}[0-9]{4}/g;

$content = $ARGV[0];
#print $content;
$urlundergrad = "http://www.handbook.unsw.edu.au/undergraduate/courses/2014/$content.html";
$urlpostgrad  = "http://www.handbook.unsw.edu.au/postgraduate/courses/2014/$content.html";

if ( @ARGV = 1 ) {
    $pageU = get("$urlundergrad") or die "unable to retrieve";
    #$pageP = get("$urlPostgrad") or die "unable to retrieve";

    foreach $line ( split( "\n", $pageU ) ) {
        if ( $line =~ $regex ) {
            push( @courses, $line );
        }
    }

    print @courses;
    print "\n";

} else {
    print "usage: prereq.pl <UNSW course>";
}

Have you been to http://perldoc.perl.org and read through the regex documentation there? — i alarmed alien, Sep 07 '14 at 07:08
Have a look at [WWW::Mechanize](https://metacpan.org/release/WWW-Mechanize). There you will also find examples. — Steffen Ullrich, Sep 07 '14 at 07:11
The data returned by `get("url")` is HTML, which you should not attempt to parse with regex. Read http://stackoverflow.com/a/1732454/1382251 in order to understand why, and keep in mind that the only person in the world that can parse HTML with regex is most likely Chuck Norris (though I admit I wasn't the first one to realize that). — barak manos, Sep 07 '14 at 07:13
thanks for the help. Where is Chuck Norris when you need him. — walle_whale, Sep 07 '14 at 07:41
@user18018: Last time I heard he was making use of Chuck Norris jokes as part of the National Rifle Association campaign for defending the Second Amendment and promoting gun laws in the United States... — barak manos, Sep 07 '14 at 07:50
You should also turn on warnings and strict behaviour by adding `use strict; use warnings;` under the first line of your program. It is good programming practice and will prevent lots of problems down the road! — i alarmed alien, Sep 07 '14 at 10:38
Funny, than everybody talking here about "parsing" and the op want simple "matching"... While I agree, than the easiest way would be using tools like `Mojo::Dom` or `Web::Scrape` but matching some text **isn't the same as parsing HTML**. — clt60, Sep 07 '14 at 15:04
So you're already pushing the prerequisite lines into the @courses array. What do you want to do with them after you're done matching them? — Len Jaffe, Sep 08 '14 at 20:28
Your definition of $regex and $regex2 should use qr// instead of m//. Will there be more than one prerequisite per line matched? — Len Jaffe, Sep 08 '14 at 20:31

score 0 · Answer 1 · answered Jun 18 '15 at 07:26

you are not using regular expressions the right way. either you can use the 'qr' operator like this:

$regex = qr/Prerequisite:.[A-Z]{4}[0-9]{4}/;
.
.
.
if ( $line =~ $regex ) {

please note that you can not use 'g' modifier with qr. also I don't see any reason to do so in your case. read more about qr here.

The other way I can think of is to use a variable to store your regular expression is this:

$regex = q/Prerequisite:.[A-Z]{4}[0-9]{4}/;
.
.
.
if ( $line =~ m/$regex/g ) {

also, please note you have a bug in line 13. you probably meant:

if ( @ARGV == 1 ) {

Matching to find a line and then extracting certain elements from the line w regex using perl

1 Answers1