0

I want to convert links such as http://google.com/ to HTML, however if they're already in an HTML link, either in the href="" or in the text for the link, I don't want to convert them.

I found this in another question:

preg_replace('@(https?:\/\/([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)@', '<a href="$1" target="_blank">$1</a>', $text);

However if I have something such as:

<a href="http://google.com/">http://google.com/</a>

already in the target text in question, it will create two links within that HTML. I can't seem to figure out the pattern for knowing if it's before /a or inside " ".

The Real Roxette
  • 159
  • 2
  • 12
  • [DON'T DO IT MAN!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – cwallenpoole Aug 20 '11 at 02:49
  • bbpress does it, except looking through their spaghetti code, I can't seem to figure out how it's doing it. – The Real Roxette Aug 20 '11 at 02:58
  • 1
    Context-awareness isn't simple, but you can likely get away with the minimum lookaround. Precede your regex with `(?<!href="|src="|">)` a negative assertion to exclude the main culprits. (Another common approach is to *normalize* the input text by removing already HTMLified URLs.) – mario Aug 20 '11 at 03:16

2 Answers2

2

Do not use regular expressions for (X)HTML parsing. Use DOM instead! The XPath //text()[not(ancestor::a) and contains(., 'http://')][1] should find the first text node containing at least one HTTP URL that is not itself contained in an anchor tag. You may naively replace the text node with a text node containing preceding text, an anchor element node containing href attribute and href text node, and a text node containing remaining text. Do that until you find no more text nodes matching the XPath.

dma_k
  • 10,431
  • 16
  • 76
  • 128
Allan
  • 21
  • 2
  • Maybe you can provide a sample XSLT to make a transformation? – dma_k Aug 20 '11 at 10:38
  • I never did any XSLT. I would implement it with a while loop because text nodes containing more than one URL need to be processed more than once. – Allan Aug 20 '11 at 16:07
0

Based on mario's comment to my original post:

preg_replace('@(?<!href="|src="|">)(https?:\/\/([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)@', '<a href="$1">$1</a>', $text);

Works perfectly for replacing bbpress's unknown pasta salad.

The Real Roxette
  • 159
  • 2
  • 12