0

I was able to successfully extract everything with your suggestions. My issue came as expected, with the regex not properly recognizing something... thanks so much!! Here is my end code... hope it helps someone!

        if($_=~/(Research Interests)/){
            $research = "Research Interest";

            if($_=~m/<h2>Research Interests<\/h2>(.*?)<p>(.*?)<\/p>/gs){
                    @researchInterests = split(/,+/, $2);
                    $count = 1;
                    foreach(@researchInterests){
                            print "$research $count:";
                            print $_. "\n";
                            $count++;
                    }
            }
    }
c alvarado
  • 69
  • 2
  • 8
  • Unfortunately I am to use regular expressions for parsing... – c alvarado Dec 02 '13 at 16:53
  • Tell your teacher that you should not use regexes for parsing HTML. For your task, you should perhaps consider if you can truly match a string like `

    Research Interests

    ^M` with the regex `/

    .*

    /`.
    – TLP Dec 02 '13 at 16:57
  • Direct your teacher to this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 -- and then find out why your doesn't understand why this is such a horribly incorrect habit to teach. – Vector Gorgoth Dec 02 '13 at 18:56
  • "Here is my end code... hope it helps someone!" How is it going to help anyone when you erased the question? – Alan Moore Dec 03 '13 at 20:31

3 Answers3

0

The problem is that you've only read in one line at a time. Why don't you read in the entire file and match against that.

my $file;
{
    local $/;
    $file = <FILE>;
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I have the whole file copied into a text file and am searching that. The goal is to extract certain fields- name, research interests, etc. So I was using the RESEARCH INTEREST line as a conditional telling me I have reached the position where I can now grab my research interests... is this incorrect logic? – c alvarado Dec 02 '13 at 17:18
  • `$_` contains `"

    Research Interests

    \r\n"`. Please reread. I didn't say anything about copying files. I said to match against the entire file rather than just the line that contains `Research Interests`.
    – ikegami Dec 02 '13 at 17:38
0

You can simply go get more lines at that point:

while (<FILE>) {
  if (m/Research Interests/) {
    while (<FILE>) {
      if (m/<p>(.*)<p>/) {
        print "Research Interests: $1";
        last;
      }
    }
  }
}

I don't know whether your file is huge or not, but it's worth learning techniques that don't require reading the whole file at once so that you can deal with arbitrarily large files, or with streams.

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • 2
    Actually, you do know. If the HTML file doesn't easily fit in memory, how is the browser going to display it. It's far less error prone to read the entire file at once. – ikegami Dec 02 '13 at 18:36
  • Maybe I should have said "I don't care" =). Since this looks like homework, I do think it's worth learning stream-compliant processing techniques. – Ken Williams Dec 02 '13 at 22:51
-1

If you absolutely have to do this, you could try setting the newline separator to undef:

#!/usr/bin/perl
use warnings;
use strict;

my $infile = 'in.txt';
open my $input, '<', $infile or die "Can't open to $infile: $!";

my $reserch_interests;
$/=undef;
while(<$input>){
        if($_ =~ /(Research Interests)/){
            $reserch_interests = $1;
                if($_=~ m/<p>(.*)<\/p>/){
                        print "Title: $reserch_interests\nInterests: $1\n";
                }
        }

}

Prints:

Title: Research Interests
Interests: Data mining, databases, information retrieval
ikegami
  • 367,544
  • 15
  • 269
  • 518
fugu
  • 6,417
  • 5
  • 40
  • 75
  • Fixed your misplaced changed to `$/`. Why use `while`? I posted a much better way 15 minutes earlier. – ikegami Dec 02 '13 at 17:37
  • Thanks flying frog, I am getting closer. I did a little reading on what setting the newline separator to undef really does because I was not getting the same results and I am a little confused. Essentially, the file is copied over as a whole, and any newline characters are ignored. This initially made sense to me, but my output is actually printing two lines **before** the Research Interest match was made. This is my output... `Title: Research Interests Interests: BS, PhD, Simon Fraser University` – c alvarado Dec 02 '13 at 18:46
  • You'll probably need to change the regexs then. Paste up some more of your input so that we can troubleshoot what's going wrong... – fugu Dec 02 '13 at 19:19
  • ok so I have been playing with this in an attempt to understand it.. and I have gone this far...I seem to be having an issue still grabbing each of the research interests: – c alvarado Dec 02 '13 at 20:17
  • I mean, I could write a regex that would be able to pull out the information you're after, but as everyone else has said - you really shouldn't even bother to parse HTML using regexs. – fugu Dec 02 '13 at 23:03