1

I am writing a webscraper in perl. I am having troubles trying to extract what I want from the data that is returned in the get("url"); function. I want find a particular line with a regex and then use another regex to match and store the matches in an array. If someone could give me an example that would be super helpful.

#!/usr/bin/perl

use LWP::Simple;

$regex  = m/Prerequisite:.[A-Z]{4}[0-9]{4}/g;
$regex2 = m/[A-Z]{4}[0-9]{4}/g;

$content = $ARGV[0];
#print $content;
$urlundergrad = "http://www.handbook.unsw.edu.au/undergraduate/courses/2014/$content.html";
$urlpostgrad  = "http://www.handbook.unsw.edu.au/postgraduate/courses/2014/$content.html";

if ( @ARGV = 1 ) {
    $pageU = get("$urlundergrad") or die "unable to retrieve";
    #$pageP = get("$urlPostgrad") or die "unable to retrieve";

    foreach $line ( split( "\n", $pageU ) ) {
        if ( $line =~ $regex ) {
            push( @courses, $line );
        }
    }

    print @courses;
    print "\n";

} else {
    print "usage: prereq.pl <UNSW course>";
}
Miller
  • 34,962
  • 4
  • 39
  • 60
  • Have you been to http://perldoc.perl.org and read through the regex documentation there? – i alarmed alien Sep 07 '14 at 07:08
  • Have a look at [WWW::Mechanize](https://metacpan.org/release/WWW-Mechanize). There you will also find examples. – Steffen Ullrich Sep 07 '14 at 07:11
  • 1
    The data returned by `get("url")` is HTML, which you should not attempt to parse with regex. Read http://stackoverflow.com/a/1732454/1382251 in order to understand why, and keep in mind that the only person in the world that can parse HTML with regex is most likely Chuck Norris (though I admit I wasn't the first one to realize that). – barak manos Sep 07 '14 at 07:13
  • thanks for the help. Where is Chuck Norris when you need him. – walle_whale Sep 07 '14 at 07:41
  • @user18018: Last time I heard he was making use of Chuck Norris jokes as part of the National Rifle Association campaign for defending the Second Amendment and promoting gun laws in the United States... – barak manos Sep 07 '14 at 07:50
  • You should also turn on warnings and strict behaviour by adding `use strict; use warnings;` under the first line of your program. It is good programming practice and will prevent lots of problems down the road! – i alarmed alien Sep 07 '14 at 10:38
  • 3
    Funny, than everybody talking here about "parsing" and the op want simple "matching"... While I agree, than the easiest way would be using tools like `Mojo::Dom` or `Web::Scrape` but matching some text **isn't the same as parsing HTML**. – clt60 Sep 07 '14 at 15:04
  • I agree with @jm666. Regex is (potentially) fine here. –  Sep 08 '14 at 13:51
  • So you're already pushing the prerequisite lines into the @courses array. What do you want to do with them after you're done matching them? – Len Jaffe Sep 08 '14 at 20:28
  • Your definition of $regex and $regex2 should use qr// instead of m//. Will there be more than one prerequisite per line matched? – Len Jaffe Sep 08 '14 at 20:31

1 Answers1

0

you are not using regular expressions the right way. either you can use the 'qr' operator like this:

$regex = qr/Prerequisite:.[A-Z]{4}[0-9]{4}/;
.
.
.
if ( $line =~ $regex ) {

please note that you can not use 'g' modifier with qr. also I don't see any reason to do so in your case. read more about qr here.

The other way I can think of is to use a variable to store your regular expression is this:

$regex = q/Prerequisite:.[A-Z]{4}[0-9]{4}/;
.
.
.
if ( $line =~ m/$regex/g ) {

also, please note you have a bug in line 13. you probably meant:

if ( @ARGV == 1 ) {
Alex G
  • 64
  • 5