How do I write a regex that performs multiple substitutions on each line, EXCEPT when the line starts with a certain string?

Question

I'm trying to write a regular expression that surrounds "http" URLs with angle brackets, except for lines beginning with two slashes. The best I've come up with is:

s#^(?!//)(.*?)(http://[^\s]+)#$1<$2>#gm;

This works great for these two:

Input: http://a.com

Output: <http://a.com>

Input: //http://a.com

Output: //http://a.com

However, it fails here:

Input: http://a.com http://b.com

Actual Output: <http://a.com> http://b.com

Desired Output: <http://a.com> <http://b.com>

Why doesn't my regular expression keep matching? Am I using /g wrong?

score 4 · Answer 1 · answered Feb 09 '09 at 20:10

4

You should really use two regexes; one to identify the "commented-out" lines and one to modify the http's in the regular lines.

There might be a non-standard way to combine the two regexes or replace all of your multiple (http...)+ matches, but I wouldn't use them.

answered Feb 09 '09 at 20:10

aib

45,516
10
73
79

The regex is fed into a legacy function that operates on a big, multi-line blob of text. I wish I could split it into lines and do what you say, but that would require major regression testing. – mike Feb 09 '09 at 20:13
Major refactoring and regression testing, I should say. – mike Feb 09 '09 at 20:14
@Mike - if you need to match the beginning of multiple lines, consider the 'm' modifier. It causes ^ and $ to match the beginning or end of any line. – Chris Lutz Feb 09 '09 at 20:19
Oh, in practice I do -- somehow that got wiped out when I was turning it into an SO question. – mike Feb 09 '09 at 20:21

score 3 · Answer 2 · answered Feb 09 '09 at 20:10

You can't really do this for an indefinite number of expressions. Try this:

s#(http://[^\s]+)#<$1>#g unless m#^//#;

This will replace all of the URLs in the line, but only if the first two characters of the line aren't "//". Sure, it's a little more complicated, but it works (I think).

EDIT: My answer is the same as aib's, but I have code.

Robert P · Accepted Answer · 2009-02-09T20:38:01.643

3

rewriting it a little...with my suggestions and using the whitespace modifier so it's actually readable. :)

s{
    (?:^|\G)     # start of the last match, so you never backtrack and don't capture.
    (?!//)       # a section without //
    (.*?)        # followed by anything
    (
        http://  # with http://
        [^\s]+   # and non-spaces - you could also use \S
    )
 }
 {$1<$2>}xmg;

Trying this in perl, we get:

sub test {
    my ($str, $expect) = @_;
    my $mod = $str;
    $mod =~ s{
            (?:^|\G)       # start of the last match, so you never backtrack.
            (?!//)       # a section without //
            (.*?)        # followed by anything
            (
                http://  # with http://
                [^\s]+   # and non-spaces - you could also use \S
            )
          }
          {$1<$2>}xmg;
    print "Expecting '$expect' got '$mod' - ";
    print $mod eq $expect ? "passed\n" : "failed\n";
}

test("http://foo.com",    "<http://foo.com>");
test("// http://foo.com", "// http://foo.com");
test("foo\nhttp://a.com","foo\n<http://a.com>");

# output is 
# Expecting '<http://foo.com>' got '<http://foo.com>' - passed
# Expecting '// http://foo.com' got '// http://foo.com' - passed
# Expecting 'foo
# <http://a.com>' got 'foo
# <http://a.com>' - passed

Edit: Couple of changes: Added the 'm' modifier to make sure that it matches from the start of a line, and change \G to (^|\G) to make sure it starts looking at the start of the line too.

edited Feb 09 '09 at 20:38

answered Feb 09 '09 at 20:19

Robert P

15,707
10
68
112

That's really really good, and I might be able to figure out the last little problem on my own, but of course any input is appreciated: In practice it also has a /m modifier, since it operates on a big blob of text. This causes it to fail on "foo\nhttp://a.com" – mike Feb 09 '09 at 20:25
...which should return "foo\n" but actually returns "foo\nhttp://a.com" – mike Feb 09 '09 at 20:25
In fact, I'm going to accept your answer anyway, since it's perfect for the question as originally asked. – mike Feb 09 '09 at 20:26
Hey, changing your \G to (^|\G) and your $1<2> to $2<3> seems to work! – mike Feb 09 '09 at 20:28
ah yeah :) Figured that one out right as I was updating the question... give me a bit and I'll add it. :) – Robert P Feb 09 '09 at 20:34
Also, I made the first group a non-capturing group. That way it's clear to others that you really don't care what the first part is. – Robert P Feb 09 '09 at 20:39

How do I write a regex that performs multiple substitutions on each line, EXCEPT when the line starts with a certain string?

3 Answers3

Linked