1

I am writing a regex find/replace that will insert a <span> into every <a href> in a file where a <span> does not already exist. It will allow other tags to be in the <a href> like <img>, <b>, etc.

Currently I have this regex:
Find: (<a[^>]+?style=".*?color:#(\w{6}).*?".*?>)(.+?)(<\/a>)
Replace: '$1<span style="color:#$2;">$3</span>$4'

It works great except if i run it over the same file, it will insert a <span> inside of a <span> and it gets messy.

Target Example:

We want it to ignore this:
<a href="http://mywebiste.com/link1.html" target="_blank" style="color:#bfbcba; text-decoration:underline;"><span style="color:#bfbcba;">Howdy</span></a>

But not this:
<a href="http://mywebiste.com/link1.html" target="_blank" style="color:#bfbcba; text-decoration:underline;">Howdy</a>

Or this:
<a href="http://mywebiste.com/link1.html" target="_blank" style="color:#bfbcba; text-decoration:underline;"><img src="myimg.gif" />Howdy</a>

--EDIT--

Using the PHP DOM library as suggested in the comments, this is what I have so far:

$doc = new DOMDocument();
$doc->loadHTML($input);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
    $spancount = $tag->getElementsByTagName("span")->length;
    if($spancount == 0){
        $element = $doc->createElement('span');
        $tag->appendChild($element);
    }
}

echo $doc->saveHTML();`

Currently it will detect if there is a span inside an anchor and if there is, it will append a span to the inside of the anchor, however, i have yet to figure out how to get the original contents of the anchor inside the span.

Caleb Larsen
  • 739
  • 2
  • 8
  • 17

1 Answers1

4

Don't use regex for this, it's not ideal for HTML.

Use a DOM library and getElementsByTagName('a') then iterate through each anchor and see if it contains a sub span element with getElementsByTagName('span'), using the length property. If it doesn't, appendChild or assign the firstChild of the anchor node to your new span created with document.createElement('span').

EDIT: As for grabbing the inner html of the anchor, if there are lots of nodes inside, try using this:

<?php
function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

$html = innerHTML( $anchorRef );

This may also help you out: Change innerHTML of a php DOMElement

Community
  • 1
  • 1
meder omuraliev
  • 183,342
  • 71
  • 393
  • 434
  • Full ack, regex and html = bad. Though I would probably use an html parser or even simplexml instead of javascript for the sake of ppl who use lynx. – Robin Aug 18 '10 at 15:59
  • Thanks for the DOM suggestions. I have started using the PHP DOM (for the first time!) and I am having a heck of a time sorting out how to take the contents of an element: `my link` in this case `my link` and then wrapping that in a span. I've had no problem creating the new span element, and appending it, but getting the original contents inside the `` has been stumping me. – Caleb Larsen Aug 18 '10 at 20:18
  • Well, it would be much easier for me ( and others ) to help if you posted your attempt in your original answer. – meder omuraliev Aug 18 '10 at 20:22
  • One problem having with the above `innerHTML()` function is that it returns a string and when I set the set the `nodeValue` of the anchor to the returned string, the HTML is escaped like: `Underlined Link<u>Underlined Link</u> ` The body of my `foreach` loop now looks like this: ` $element = $doc->createElement('span'); $content = innerHTML($tag); $element->setAttribute('style','color:#ffffff;'); $element->nodeValue = $content; $tag->nodeValue = ""; //clear node $tag->appendChild($element);` – Caleb Larsen Aug 19 '10 at 15:48