3

so i have this:

for $i (0..@parsedText) {
if ($parsedText[$i] =~ /\s{20}<a href/) {

    my $eventID = $parsedText[$i];
    my $eventLink = $parsedText[$i];
    my $event_id_title = $parsedText[$i];

    $eventID =~ s/[\s\S]*?id=(\d+).*\n/$1/;
    $eventLink =~ s/[\s\S]*?'(.*?)'.*/$1/;
    $event_id_title =~ s/\s+<a[\s\S]*?>([^<]*).*\n/$1/;
    };
};

but for some reason, if I print any of them, it returns the original value, instead of the string replacement that i WANT it to return.

Thanks for your help

Aelfhere
  • 171
  • 2
  • 12
  • 2
    You should finish off your example to show us exactly how you're printing it. It sounds trivial, bu in this case, could be important. – Greg Jackson Jun 23 '11 at 21:19
  • 3
    :O positive votes on parsing html with regex; anyways, your syntax looks to be correct, are you sure your regexes are right? What are your inputs? – NorthGuard Jun 23 '11 at 21:35
  • 1
    Your `for` loop should be written `for my $i (0 .. $#parsedText)`. Your example will read one element past the end. And agreeing with Greg, you need to post exactly how you are printing the data. – Eric Strom Jun 23 '11 at 21:38
  • See also: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 ;-) – johnsyweb Jun 23 '11 at 22:39

2 Answers2

5

You're getting the same in as out because the first part of your match isn't matching, so no substitution is being done.

My guess is (since no input has been shown) that you don't have newlines in your parsedText array. Here's a slightly cleaner way of writing what you've done above:

foreach ( @parsedText ) {
  if (/\s{20}<a href/) {

    ( my $eventID = $_ )        =~ s/.*?id=(\d+).*/$1/;
    ( my $eventLink = $_ )      =~ s/.*?'(.*?)'.*/$1/;
    ( my $event_id_title = $_ ) =~ s/\s+<a.*?>(.*?)<.*/$1/;

    print "$eventID, $eventLink, $event_id_title\n";
  }
}

Generally, you should avoid parsing HTML like this and instead use the years of collected wisdom that is http://cpan.org and use HTML::Parser, HTML::Parser::Simple, or HTML::TreeBuilder.

unpythonic
  • 4,020
  • 19
  • 20
  • Thank you very much, this was the issue. I had split the string at newlines to make an array, but then immediately forgot this when writing the regex. Also, I will look into these parsers... although with my very basic understanding it might be a bit confusing to me. So expect more questions forthcoming :P – Aelfhere Jun 24 '11 at 14:27
0

This works...

my $eventID = $parsedText[$i];
my $eventLink = $parsedText[$i];
my $event_id_title = $parsedText[$i];

$eventID =~ s/.*id=['"]?(\d+)['"]?.*/$1/;
$eventLink =~ s/^.+a\s+href\s*=\s*(['"])([^\1]+)\1.*/$2/;
$event_id_title =~ s/\s+<a.*?>([^<]*).*/$1/;

print "$eventID\n";
print "$eventLink\n";
print "$event_id_title\n";

Regular expressions can be tricky. It's best you build a test program and test them bit by bit until you get what you want. Remember that you can use single or double quotes in HTML, and that URLs can have quotes in them. And, IDs don't have to be numeric (although I kept it as such here).

The '\1' in the $eventLink references either a single quote or double quote. Since it's part of the regular expression, you need the backslash in front of the number and not a dollar sign.

David W.
  • 105,218
  • 39
  • 216
  • 337