0

I got the following code which replaces URL by the corresponding links:

$in = array
(
        '/(?:^|\b)((((http|https|ftp):\/\/)|(www\.))([\w\.]+)([,:%#&\/?=\w+\.-]+))(?:\b|$)/is'
);
$out = array
(
        "<a href=\"$1\" target=\"_blank\">$1</a>"
);
return preg_replace($in, $out, $url);

However, I do not wish that URLS inside a SRC="url" atribute are converted into links.

How can I exclude URL enclosed inside an attribute from this pattern?

UPDATE: input would be:

Bellow you can see http://www.yahoo.com bla bla
<iframe src="http://yahoo.com"></frame

It need o parse the first link but not the URL inside the src=""

Guillermo
  • 927
  • 3
  • 10
  • 23
  • 5
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ignacio Vazquez-Abrams Apr 26 '11 at 01:31
  • 2
    And that's exactly why you don't use regular expressions to handle irregular languages like HTML. – deceze Apr 26 '11 at 01:34
  • But is not posible to exclude links preceeded by a >"< ?? – Guillermo Apr 26 '11 at 01:37
  • @deceze What _do_ you use? I know there are alternatives in this case and a lot of others, but it's a bit of a sweeping generalisation to say that regex shouldn't be used on HTML. – Dan Blows Apr 26 '11 at 01:39
  • I just need to convert links that are not preceeded by >" – Guillermo Apr 26 '11 at 01:42
  • @Blowski An (X)HTML/DOM parser. HTML is a language that needs to be *parsed*. Regular expressions are only an option if the input is limited to a very regular subset of HTML. They very easily break down in situations like these. – deceze Apr 26 '11 at 01:43
  • @Guillermo It would help if you could clarify what your input is. Sounds like it may or may not contain HTML? – deceze Apr 26 '11 at 01:44
  • See also: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not/590789#590789 – johnsyweb Apr 26 '11 at 01:48
  • @deceze What do you use on invalid HTML? Genuine question, as I use regex to grab sections of code (like everything between `` and `` for example) regardless of it's validity. I tried using PHP's DOMDocument but had issues. I'm doing it server-side so can't use jQuery. If there's a better way, I genuinely would like to know. – Dan Blows Apr 26 '11 at 01:48
  • @Blowski Operations on invalid HTML are by definition undefined. There can only be a best effort to extract any information from it and regular expressions are just as likely to break down as a parser. You'd usually try to fix the HTML first using a very lenient parser like Tidy, then proceed to parse it with a DOM parser. Doing it in a browser is basically the same thing; the browser has already mercifully done its best to make something out of the invalid HTML so you can traverse a proper DOM tree using Javascript. – deceze Apr 26 '11 at 01:52
  • @deceze In the project I have in mind, I needed the raw HTML. It couldn't be changed in any way. It ran more than 20,000 times with no problems - I guess because missing `` tags is a much rarer problem than invalid HTML within the `` tags. – Dan Blows Apr 26 '11 at 01:57
  • @Guillermo You keep asking for just excluding URLs surrounded by `"`, but I'm sure you'd like to parse `Hi, this is a URL: "http://example.com" And this is not: `. You *do* need an HTML parser. – deceze Apr 26 '11 at 01:58
  • @Blowski There's also only one `` tag pair (usually). As long as that's there, it's not hard to grab anything in between. The OP is looking at a much more complex problem though. – deceze Apr 26 '11 at 02:00
  • @deceze Agreed, my comments are totally off-topic of the OP. Just a comment really that there are use cases for regex with HTML. – Dan Blows Apr 26 '11 at 02:02
  • @deceze Do you know any good HTML parser for PHP? (Thanks everyone for the help!!) – Guillermo Apr 26 '11 at 02:06
  • Start here: http://www.php.net/manual/en/refs.xml.php – deceze Apr 26 '11 at 02:08

1 Answers1

0

Use this php code to extract links except for src=""

<?php
   $p = '/((<)(?(2).*?src=[^>]*>).*?)*?((?:(?:(?:http|https|ftp):\/\/)|(?:www\.))(?:[\w\.]+)(?:[,:%#&\/?=\w+\.-]+))/smi';

   // multi-line input text
   $str = 'Visit http://www.google.com bla bla <iframe src="http://apple.com">
           </frame> Bellow you can see http://www.ibm.com bla bla';

   preg_match_all($p, $str, $m);
   var_dump( $m[3] );
?>

OUTPUT:

array(2) {
  [0]=>
  string(21) "http://www.google.com"
  [1]=>
  string(18) "http://www.ibm.com"
}


SUGGESTION:

Rather than making an exception for src="" for extracting links I think it would be better to exclude all the links enclosed in < and > by using following regex:

$p = '/((<)(?(2)[^>]*>)(?:.*?))*?((?:(?:http|https|ftp):\/\/|www\.).*?[,:%#&\/?=\w+\.-]+)/smi';
anubhava
  • 761,203
  • 64
  • 569
  • 643