4

I want to match urls that do NOT contain the string 'localhost' using Ruby regex

Based on answers and comments here, I put together two solutions, both of which seem to work:

Solution A:

(?!.*localhost)^.*$ 

Example: http://rubular.com/r/tQtbWacl3g

Solution B:

^((?!localhost).)*$ 

Example: http://rubular.com/r/2KKnQZUMwf

The problem is that I don't understand what they're doing. For example, according to the docs, ^ can be used in various ways:

[^abc]  Any single character except: a, b, or c  
^ Start of line  

But I don't get how it's being applied here.

Can someone breakdown these expressions for me, and how they differ from one another?

Community
  • 1
  • 1
Yarin
  • 173,523
  • 149
  • 402
  • 512

4 Answers4

5

In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:

^(?!.*localhost).*$ 

The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.

Now the other one

^((?!localhost).)*$

This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.

In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • @m.buettner- Thanks for this excellent in-depth answer. However, not clear why `^` is required for the lookahead on the first example- why does `(?!.*localhost).*$` not work? Doesn't regex search from the beginning of the string by default? – Yarin Aug 18 '13 at 15:50
  • @Yarin yes, but say your string `localhost:80`, then the regex would fail at the beginning of the string. but without the anchor, it's free to try again at later position (just like how `/foo/` can find `foo` in `"barfoobar"`). so the engine makes a second attempt at the next position. now that the `l` is left of the starting position, `localhost` cannot be found any more (there's only `ocalhost` left), and you would get an undesired match of `ocalhost:80`. – Martin Ender Aug 18 '13 at 15:52
  • @m.buettner- Got it, thanks- one more question though: How are `(?!.*localhost)^.*$` and `^(?!.*localhost).*$` equivalent? Having the `^` at the beginning makes sense to me now, but having it after the parenthesis is still confusing me. – Yarin Aug 18 '13 at 15:58
  • @Yarin lookaheads don't advance the position of engine's "cursor". so after the lookahead is done you're still at the same position as before (that's how they **look** ahead). so whether you first check that you're at the beginning of the string, and then whether there's no `localhost` or vice versa is like saying `if(a && b)` vs. `if(b && a)` – Martin Ender Aug 18 '13 at 16:00
3

You can get them easily explained online. The first:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    localhost                'localhost'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
--------------------------------------------------------------------------------
                           ' '

And the second:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1 (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
      localhost                'localhost'
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
    .                        any character except \n
--------------------------------------------------------------------------------
  )*                       end of \1 (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in \1)
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
--------------------------------------------------------------------------------
Carl Norum
  • 219,201
  • 40
  • 422
  • 469
3

As an aside comment, these two solutions are slow. A better way is to use:

^(?:[^l]+|l(?!ocalhost))+

In other words: all characters that are not a l or a l not followed by ocalhost

This will give you a better result since you don't have to check each positions. (For an url like http://localhost:1234/toto this kind of pattern will fail in ~15 steps vs ~50 steps for the two other patterns)

You can improve this pattern using atomic groups and possessive quantifiers to forbid backtracks:

^(?>[^l]++|l(?!ocalhost))++

Note that in your particular case you can speed up your pattern considering that you only want to check the host part of the url. Example:

^http:\/\/(?>[^l\s\/]++|l(?!ocalhost))++(?>\/\S*+|$)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Quite a bold claim ;) (although I can see that this should be true). However, especially for the first answer, I'd prefer if you supported that claim with actual benchmark results. I think some engine optimisation could well tip the scales in favour of the OPs first pattern (for failing strings). – Martin Ender Aug 18 '13 at 15:56
2

according to the docs, ^ can be used in various ways:

[^abc]  Any single character except: a, b, or c   
^ Start of line  

But I don't get how it's being applied here.

In the regex

(?!.*localhost)^.*$ 

The ^ is not inside any brackets, so the second one applies. Here is a trivial example:

/^x/

That regex says to match the start of the line, followed by the letter x. So it will match lines like this:

 xcellent
 x-ray

However, the regex will not match the lines:

 axb
 excellent

...because the x does not appear directly after the start of the line. You may wonder why 'axb' doesn't match. After all 'a' is the start of the line, and it is followed by an 'x'. However, 'start of the line' is just to the left of the first character, like this:

   |
   V
    axb

^ is called a zero-width match because it matches the slim sliver just to the left of the 'a', e.g. between the starting quote mark and the 'a' in "axb". There's not really any space there, so ^ matches something that is 0 width.

Here is another example:

/x^/

That says to match the character x followed by the start of the line. Well, no line can have an x first and then the start of the line second, so that won't ever match anything.

Now your regex:

(?!.*localhost)^.*$

Like the 'start of line' ^, a lookahead is zero-width. What that means is that the lookahead scans the string looking for the match, but when it finds the match, it comes back to the beginning of the string, and then looks for the rest of the regex:

^.*$

One word of advice, when a regex requires lookarounds(lookaheads or lookbehinds), 99% of the time there are easier ways to do what you want. For instance, you could write:

url = "....."

if url.index('http') == 0
   #then the line starts with 'http'
else
   #the line doesn't start with http
end

That's much easier to read, and it doesn't require trying to decipher a complex regex.

7stud
  • 46,922
  • 14
  • 101
  • 127
  • @Yarin, Hey I added some advice at the end. – 7stud Aug 18 '13 at 16:10
  • @7stud- Yea thanks- I realize using Ruby `else` logic is generally preferred, but this is for passing a list of regex match conditions to a 3rd party filter function, so we don't have that choice – Yarin Aug 18 '13 at 16:16
  • @Yarin, Also...when you are using rubular, it is often helpful to use capturing parentheses around parts of your regex to see what they match. For instance, if you use the regex: `((?!.*localhost))(^.*$)`, rubular will show the matches for group 1 and group 2. Note that the match for group1 is blank--that's because a lookahead is one of those 0 width things that doesn't really match any characters in the string--it just looks for them. Rubular could be improved--it could show whether a lookahead finds what it is looking for. – 7stud Aug 18 '13 at 16:16