3

I'm trying to write a regular expression that surrounds "http" URLs with angle brackets, except for lines beginning with two slashes. The best I've come up with is:

s#^(?!//)(.*?)(http://[^\s]+)#$1<$2>#gm;

This works great for these two:


Input: http://a.com

Output: <http://a.com>


Input: //http://a.com

Output: //http://a.com


However, it fails here:


Input: http://a.com http://b.com

Actual Output: <http://a.com> http://b.com

Desired Output: <http://a.com> <http://b.com>


Why doesn't my regular expression keep matching? Am I using /g wrong?

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
mike
  • 46,876
  • 44
  • 102
  • 112

3 Answers3

4

You should really use two regexes; one to identify the "commented-out" lines and one to modify the http's in the regular lines.

There might be a non-standard way to combine the two regexes or replace all of your multiple (http...)+ matches, but I wouldn't use them.

aib
  • 45,516
  • 10
  • 73
  • 79
  • The regex is fed into a legacy function that operates on a big, multi-line blob of text. I wish I could split it into lines and do what you say, but that would require major regression testing. – mike Feb 09 '09 at 20:13
  • Major refactoring and regression testing, I should say. – mike Feb 09 '09 at 20:14
  • @Mike - if you need to match the beginning of multiple lines, consider the 'm' modifier. It causes ^ and $ to match the beginning or end of any line. – Chris Lutz Feb 09 '09 at 20:19
  • Oh, in practice I do -- somehow that got wiped out when I was turning it into an SO question. – mike Feb 09 '09 at 20:21
3

You can't really do this for an indefinite number of expressions. Try this:

s#(http://[^\s]+)#<$1>#g unless m#^//#;

This will replace all of the URLs in the line, but only if the first two characters of the line aren't "//". Sure, it's a little more complicated, but it works (I think).

EDIT: My answer is the same as aib's, but I have code.

Chris Lutz
  • 73,191
  • 16
  • 130
  • 183
3

rewriting it a little...with my suggestions and using the whitespace modifier so it's actually readable. :)

s{
    (?:^|\G)     # start of the last match, so you never backtrack and don't capture.
    (?!//)       # a section without //
    (.*?)        # followed by anything
    (
        http://  # with http://
        [^\s]+   # and non-spaces - you could also use \S
    )
 }
 {$1<$2>}xmg;

Trying this in perl, we get:

sub test {
    my ($str, $expect) = @_;
    my $mod = $str;
    $mod =~ s{
            (?:^|\G)       # start of the last match, so you never backtrack.
            (?!//)       # a section without //
            (.*?)        # followed by anything
            (
                http://  # with http://
                [^\s]+   # and non-spaces - you could also use \S
            )
          }
          {$1<$2>}xmg;
    print "Expecting '$expect' got '$mod' - ";
    print $mod eq $expect ? "passed\n" : "failed\n";
}

test("http://foo.com",    "<http://foo.com>");
test("// http://foo.com", "// http://foo.com");
test("foo\nhttp://a.com","foo\n<http://a.com>");

# output is 
# Expecting '<http://foo.com>' got '<http://foo.com>' - passed
# Expecting '// http://foo.com' got '// http://foo.com' - passed
# Expecting 'foo
# <http://a.com>' got 'foo
# <http://a.com>' - passed

Edit: Couple of changes: Added the 'm' modifier to make sure that it matches from the start of a line, and change \G to (^|\G) to make sure it starts looking at the start of the line too.

Robert P
  • 15,707
  • 10
  • 68
  • 112
  • That's really really good, and I might be able to figure out the last little problem on my own, but of course any input is appreciated: In practice it also has a /m modifier, since it operates on a big blob of text. This causes it to fail on "foo\nhttp://a.com" – mike Feb 09 '09 at 20:25
  • ...which should return "foo\n" but actually returns "foo\nhttp://a.com" – mike Feb 09 '09 at 20:25
  • In fact, I'm going to accept your answer anyway, since it's perfect for the question as originally asked. – mike Feb 09 '09 at 20:26
  • Hey, changing your \G to (^|\G) and your $1<2> to $2<3> seems to work! – mike Feb 09 '09 at 20:28
  • ah yeah :) Figured that one out right as I was updating the question... give me a bit and I'll add it. :) – Robert P Feb 09 '09 at 20:34
  • Also, I made the first group a non-capturing group. That way it's clear to others that you really don't care what the first part is. – Robert P Feb 09 '09 at 20:39