Multi-Line Regular Expression

Question

I'm trying to match out of this text:

<a href="http://english317.ning.com/profiles/blogs/bad-business-writing-487">Continue</a>
                                      </div>
                <p class="small">

                                                    Added by <a href="/profile/KemberleyRamirez">Kemberley Ramirez</a> on September 2, 2010 at 11:38pm

I'd like to get the text after /blogs (e.g. "bad-business-writing-487") and also the added by string (Student Name and submit date) (e.g. "Kemberley Ramirez on September 2, 2010 at 11:38pm")

I'm using UltraEdit with Perl expressions.

[Friends don't let friends parse HTML with regular expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Ether, Sep 03 '10 at 14:44
I didn't ask if I should; I asked HOW TO. And it's perfectly feasible in this situation with the fact that the tags are routinely in the same place to parse it with REGEX. — Caveatrob, Sep 04 '10 at 07:54

score 3 · Accepted Answer · answered Sep 03 '10 at 15:51

I don't know what exactly you are trying to match, but you are better off using a proper HTML parser:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

my $blog_re = qr{^http://english317.ning.com/profiles/blogs/(.+)\z};
my $profile_re = qr{^/profile/(\w+)\z};

while ( my $tag = $parser->get_tag('a') ) {
    next unless my ($href) = $tag->get_attr('href');
    if ( $href =~ $blog_re or $href =~ $profile_re ) {
        print "[$1]\n";
    }
}

__DATA__
<a href="http://english317.ning.com/profiles/blogs/bad-business-writing-487">Continue</a>
                                      </div>
                <p class="small">

                                                    Added by <a href="/profile/KemberleyRamirez">Kemberley Ramirez</a> on September 2, 2010 at 11:38pm

score 0 · Answer 2 · answered Sep 05 '10 at 06:46

0

Using PowerGrep in "dot matches newline" mode, I came up with this:

(?>profiles/blogs/(.*?)").*?added by(.*?)</a>(.*?2010.*?\d{2}[ap]m)

(and then an extra processing search) <?a.*?>

answered Sep 05 '10 at 06:46

Caveatrob

12,667
32
107
187

score -1 · Answer 3 · answered Sep 03 '10 at 09:18

The /s and /m modifiers control how multiple lines are handled. see perlretut

You probably want something like rrr reg.exps with the /s modifier, or something like this: (untested)

$foo =~ m|blogs/([^"]+).*Added by <[^>]+>([^<]+)</a>|s

Using m|| instead of // to avoid all the escaping ..

score -2 · Answer 4 · answered Sep 03 '10 at 10:19

-2

Following should work for multiple lines:

.*blogs\/(\S+)".*\(\n.*\)*<a.*>(.*)<\/a>(.*)

answered Sep 03 '10 at 10:19

Divya Saxena

48
1

Multi-Line Regular Expression

4 Answers4