1

I have the following regex rule:

'/((f|ht)tp)(.*?)(.gif|.png|.jpg|.jpeg)/'

It works great, but I don't want it to match anything that is preceded by a newline and 4 or more spaces, that means something like this:

"\n    "

How can do this?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Frantisek
  • 7,485
  • 15
  • 59
  • 102
  • Yeah, like those ``code`` parts here on Stack, where if you start a line with 4 or more spaces, it becomes a code and isn't processed. – Frantisek Feb 20 '13 at 00:12
  • Negative assertions `(?<!...)`. * See also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. – mario Feb 20 '13 at 00:15
  • @mario Note that assertions with `<` in them (`(?<=...)` and `(?<=...)` are lookbehind assertions and probably not really what is needed here. Here lookahead assertions `(?=...)` and `(?!...)` are more appropriate (specifically the negative lookahead `(?!...)` in this case). – Mike Brant Feb 20 '13 at 00:20
  • Negative lookbehind would be best here if regex in PHP supported variable length lookbehinds. The regex would be `/(?<!\n {4,}).../` where `...` is the existing regex. – Andrew Clark Feb 20 '13 at 00:43

2 Answers2

1

I have added a negative lookahead anchored at the beginning of the line. It checks for the existence of a newline character followed by 4 or more whitespace characters. If this condition exists the match will fail.

'/^(?!\n\s{4,}).*((f|ht)tp)(.*?)(.gif|.png|.jpg|.jpeg)/'
Mike Brant
  • 70,514
  • 10
  • 99
  • 103
  • Have you tried it? It doesn't seem to work (the match passes), but I might be doing something wrong, although I'd say I'm not. – Frantisek Feb 20 '13 at 00:20
  • @RichardRodriguez Yeah just did a little testing and it seems I am able to get it to work when moving the line start anchor outside of the lookahead (not sure why it didn't work inside). Take a look at the revised answer. – Mike Brant Feb 20 '13 at 00:33
1

You don't need to include the linefeed itself in the lookahead, just use the start anchor (^) in multiline mode. Also, since \s can match all kinds of whitespace including linefeeds and tabs, you're better off using a literal space character:

'/^(?! {4}).*(f|ht)tp(.*?)(.gif|.png|.jpg|.jpeg)/m'

Speaking of tabs, they can be used in place of the four spaces to create code blocks here on SO, so you might want to allow for that as well:

'/^(?! {4}|\t).*(f|ht)tp(.*?)(.gif|.png|.jpg|.jpeg)/m'

Finally, if you want the regex to match (as in consume) only the URL, you can use the match-start-reset operator, \K. It acts like a positive lookbehind, without the fixed-length limitation:

'/^(?! {4}|\t).*?\K(f|ht)tp(.*?)(.gif|.png|.jpg|.jpeg)/m'
Alan Moore
  • 73,866
  • 12
  • 100
  • 156