-1

I'm trying to match out of this text:

<a href="http://english317.ning.com/profiles/blogs/bad-business-writing-487">Continue</a>
                                      </div>
                <p class="small">

                                                    Added by <a href="/profile/KemberleyRamirez">Kemberley Ramirez</a> on September 2, 2010 at 11:38pm   

I'd like to get the text after /blogs (e.g. "bad-business-writing-487") and also the added by string (Student Name and submit date) (e.g. "Kemberley Ramirez on September 2, 2010 at 11:38pm")

I'm using UltraEdit with Perl expressions.

Caveatrob
  • 12,667
  • 32
  • 107
  • 187
  • You might find this site useful: regexlib.com/ – vlood Sep 03 '10 at 08:17
  • 5
    [Friends don't let friends parse HTML with regular expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Ether Sep 03 '10 at 14:44
  • I didn't ask if I should; I asked HOW TO. And it's perfectly feasible in this situation with the fact that the tags are routinely in the same place to parse it with REGEX. – Caveatrob Sep 04 '10 at 07:54

4 Answers4

3

I don't know what exactly you are trying to match, but you are better off using a proper HTML parser:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

my $blog_re = qr{^http://english317.ning.com/profiles/blogs/(.+)\z};
my $profile_re = qr{^/profile/(\w+)\z};

while ( my $tag = $parser->get_tag('a') ) {
    next unless my ($href) = $tag->get_attr('href');
    if ( $href =~ $blog_re or $href =~ $profile_re ) {
        print "[$1]\n";
    }
}

__DATA__
<a href="http://english317.ning.com/profiles/blogs/bad-business-writing-487">Continue</a>
                                      </div>
                <p class="small">

                                                    Added by <a href="/profile/KemberleyRamirez">Kemberley Ramirez</a> on September 2, 2010 at 11:38pm
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
0

Using PowerGrep in "dot matches newline" mode, I came up with this:

(?>profiles/blogs/(.*?)").*?added by(.*?)</a>(.*?2010.*?\d{2}[ap]m)

(and then an extra processing search) <?a.*?>

Caveatrob
  • 12,667
  • 32
  • 107
  • 187
-1

The /s and /m modifiers control how multiple lines are handled. see perlretut

You probably want something like rrr reg.exps with the /s modifier, or something like this: (untested)

$foo =~ m|blogs/([^"]+).*Added by <[^>]+>([^<]+)</a>|s

Using m|| instead of // to avoid all the escaping ..

Øyvind Skaar
  • 2,278
  • 15
  • 15
-2

Following should work for multiple lines:

.*blogs\/(\S+)".*\(\n.*\)*<a.*>(.*)<\/a>(.*)