0

I am trying to make a regex to identify relative src paths using PHP. To do this my idea was to use a look ahead (?= then not ^ and a subexpression (http) but this doesn't work. It works for a single charater but the ^ doesn't work with a subexpression. Is there an && operator or something?

 <img.*?src=[\'\"]\(?=^(http))

I need it to take the entire http or else imgs with starting with h, t or p will be prejudiced against. Any suggestions? Is this task too big for regex?

mario
  • 144,265
  • 20
  • 237
  • 291
joel
  • 109
  • 1
  • 1
  • 5

2 Answers2

2

You can use negative lookahead, which is (?!...) instead of (?=...). For your example (I'd put the anchor at the start):

^(?!http)

Which reads: start of string, then something which is not "http".

Edit: since you updated with a fuller example:

<img [^>]*src=['"](?!http)([^'"]+)['"]

                          ^------^ - this capturing group captures the link
                                     which doesn't start with http

Of course, for proper parsing you should use DOM ;)

porges
  • 30,133
  • 4
  • 83
  • 114
0

It's not the most useful answer, but it sounds as though you've reached the limit of applicabiliy for Regex in HTML parsing.

As per this answer here look at using a HTML DOM Parser. I haevn't used PHP DOM Parser's much, but I know in other languages, a DOM parser often makes HTML tasks a 30 second job, rather than an hour or more of weird exceptional case testing.

Community
  • 1
  • 1
Matt Mitchell
  • 40,943
  • 35
  • 118
  • 185
  • 1
    I tend to jump on the "Don't parse *ML with regex" bandwagon as well, but in this case, this question is really independent of HTML parsing. It's actually a question of URL parsing. Even if joel uses a proper parser to extract the URL, he still has the same basic problem. – Frank Farmer May 05 '11 at 02:42
  • @Frank Farmer - Yep you're right, although if you had a parser to grab the value of the SRC attribute, couldn't you just do a `StartsWith("http://")` equivalent in PHP – Matt Mitchell May 05 '11 at 02:44