Parse links, except for links inside a src=""

Question

I got the following code which replaces URL by the corresponding links:

$in = array
(
        '/(?:^|\b)((((http|https|ftp):\/\/)|(www\.))([\w\.]+)([,:%#&\/?=\w+\.-]+))(?:\b|$)/is'
);
$out = array
(
        "<a href=\"$1\" target=\"_blank\">$1</a>"
);
return preg_replace($in, $out, $url);

However, I do not wish that URLS inside a SRC="url" atribute are converted into links.

How can I exclude URL enclosed inside an attribute from this pattern?

UPDATE: input would be:

Bellow you can see http://www.yahoo.com bla bla
<iframe src="http://yahoo.com"></frame

It need o parse the first link but not the URL inside the src=""

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Ignacio Vazquez-Abrams, Apr 26 '11 at 01:31
And that's exactly why you don't use regular expressions to handle irregular languages like HTML. — deceze, Apr 26 '11 at 01:34
@deceze What _do_ you use? I know there are alternatives in this case and a lot of others, but it's a bit of a sweeping generalisation to say that regex shouldn't be used on HTML. — Dan Blows, Apr 26 '11 at 01:39
@Blowski An (X)HTML/DOM parser. HTML is a language that needs to be *parsed*. Regular expressions are only an option if the input is limited to a very regular subset of HTML. They very easily break down in situations like these. — deceze, Apr 26 '11 at 01:43
@Guillermo It would help if you could clarify what your input is. Sounds like it may or may not contain HTML? — deceze, Apr 26 '11 at 01:44
See also: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not/590789#590789 — johnsyweb, Apr 26 '11 at 01:48
@deceze What do you use on invalid HTML? Genuine question, as I use regex to grab sections of code (like everything between `` and `` for example) regardless of it's validity. I tried using PHP's DOMDocument but had issues. I'm doing it server-side so can't use jQuery. If there's a better way, I genuinely would like to know. — Dan Blows, Apr 26 '11 at 01:48
@Blowski Operations on invalid HTML are by definition undefined. There can only be a best effort to extract any information from it and regular expressions are just as likely to break down as a parser. You'd usually try to fix the HTML first using a very lenient parser like Tidy, then proceed to parse it with a DOM parser. Doing it in a browser is basically the same thing; the browser has already mercifully done its best to make something out of the invalid HTML so you can traverse a proper DOM tree using Javascript. — deceze, Apr 26 '11 at 01:52
@deceze In the project I have in mind, I needed the raw HTML. It couldn't be changed in any way. It ran more than 20,000 times with no problems - I guess because missing `` tags is a much rarer problem than invalid HTML within the `` tags. — Dan Blows, Apr 26 '11 at 01:57
@Guillermo You keep asking for just excluding URLs surrounded by `"`, but I'm sure you'd like to parse `Hi, this is a URL: "http://example.com" And this is not: `. You *do* need an HTML parser. — deceze, Apr 26 '11 at 01:58
@Blowski There's also only one `` tag pair (usually). As long as that's there, it's not hard to grab anything in between. The OP is looking at a much more complex problem though. — deceze, Apr 26 '11 at 02:00
@deceze Agreed, my comments are totally off-topic of the OP. Just a comment really that there are use cases for regex with HTML. — Dan Blows, Apr 26 '11 at 02:02
@deceze Do you know any good HTML parser for PHP? (Thanks everyone for the help!!) — Guillermo, Apr 26 '11 at 02:06

anubhava · Answer 1 · 2011-04-26T04:01:16.630

Use this php code to extract links except for src=""

<?php
   $p = '/((<)(?(2).*?src=[^>]*>).*?)*?((?:(?:(?:http|https|ftp):\/\/)|(?:www\.))(?:[\w\.]+)(?:[,:%#&\/?=\w+\.-]+))/smi';

   // multi-line input text
   $str = 'Visit http://www.google.com bla bla <iframe src="http://apple.com">
           </frame> Bellow you can see http://www.ibm.com bla bla';

   preg_match_all($p, $str, $m);
   var_dump( $m[3] );
?>

OUTPUT:

array(2) {
  [0]=>
  string(21) "http://www.google.com"
  [1]=>
  string(18) "http://www.ibm.com"
}

SUGGESTION:

Rather than making an exception for src="" for extracting links I think it would be better to exclude all the links enclosed in < and > by using following regex:

$p = '/((<)(?(2)[^>]*>)(?:.*?))*?((?:(?:http|https|ftp):\/\/|www\.).*?[,:%#&\/?=\w+\.-]+)/smi';

Parse links, except for links inside a src=""

1 Answers1

OUTPUT:

SUGGESTION: