Remove duplicate link with regex

Question

I'm trying to parse some html and remove an unnecessary duplicate link. For example, I would like the following code:

<p>
  Lorem ipsum amet 
  <a href="http://edition.cnn.com/">
    Proin lacinia posuere
  </a>
   sit ipsum.
</p>
<p>
  <a href="http://www.google.com/articles/blah">
    [caption align="alignright"]
    <a href="http://www.google.com/articles/blah">
      <img src="http://hoohlr.dev/Picture-142-300x222.png" alt="Blah blah/Flickr " height="222" class="size-medium wp-image-4351" />
    </a>
     sociis magnis [/caption]
  </a>
</p>

To be converted into this (removing the link before the [caption] as well as the closing tag:

<p>
  Lorem ipsum amet 
  <a href="http://edition.cnn.com/">
    Proin lacinia posuere
  </a>
   sit ipsum.
</p>
<p>
  [caption align="alignright"]
  <a href="http://www.google.com/articles/blah">
    <img src="http://hoohlr.dev/Picture-142-300x222.png" alt="Blah blah/Flickr " height="222" class="size-medium wp-image-4351" />
  </a>
   sociis magnis [/caption]
</p>

The link removed should always be just before the [caption]. Can anyone good with regex help me do this using php preg_replace (or simpler method)?

I would be much appreciative. Thanks!

Edit: OK, I've made a pretty good attempt at what I'm looking for. http://regexr.com?31t05 and http://regexr.com?31svv Tried to post it as an answer by the site wouldn't let me... Can anyone improve upon it?

Basically I'm building a migration script and I used DOMDocument() to rebuild the img tags using get_image_send_to_editor() in WordPress. If the img tag had an anchor as a parent, it replaced with [caption][/caption]. It fails to remove the anchor outside the [caption] which is what I'm trying to do now. So yes, it's invalid because I made it invalid. Trying to fix that. Thanks! — wired, Aug 21 '12 at 03:50
@Mechanicalsnail - I don't think that applies here. The OP *cannot* use a DOM parser, since he's dealing with (known-to-be) invalid HTML. — Joseph Silber, Aug 21 '12 at 03:53
OK, I've made a pretty good attempt at what I'm looking for. http://regexr.com?31t05 and http://regexr.com?31svv Tried to post it as an answer but the site wouldn't let me... Can anyone improve upon it? — wired, Aug 21 '12 at 04:30

score 0 · Accepted Answer · answered Aug 21 '12 at 04:46

This tested script works for your test data:

<?php // test.php Rev:20120820_2200
function stripNestedAnchorTags($text) {
    $re = '%
        # Match (invalid) outer A element containing inner A element.
        <a\b[^<>]+>\s*               # Outer A element start tag (and ws).
        (                            # $1: contents of outer A element.
          [^<]*(?:<(?!/?a\b)[^<]*)*  # Everything up to inner <a>
          <a\b[^<>]+>                # Inner A element start tag.
          [^<]*(?:<(?!/?a\b)[^<]*)*  # Everything up to inner </a>
          </a>                       # Inner A element end tag.
          [^<]*(?:<(?!/?a\b)[^<]*)*  # Everything up to outer </a>
        )                            # End $1: contents of outer A.
        </a>\s*                      # Outer A element end tag (and ws).
        %ix';
        while(preg_match($re, $text))
            $text = preg_replace($re, '$1', $text);
        return $text;
}
$idata = file_get_contents('testdata.html');
$odata = stripNestedAnchorTags($idata);
file_put_contents('testdata_out.html', $odata);
?>

Thank you kind Sir! I don't quite understand it all, but I've tested and it seems to work flawlessly. Huge thanks!! I'd give you an upvote if I could. — wired, Aug 23 '12 at 19:26

Remove duplicate link with regex

1 Answers1