Do not use regex. Why? See: RegEx match open tags except XHTML self-contained tags
Use proper DOM manipulators. See: http://php.net/manual/en/book.dom.php
EDIT:
I'm not really a fan of giving cookbook-recipes, so here's a solution for changing double <br />
's to text wrapped in <p></p>
:
script.php:
<?php
function isBlockElement($nodeName) {
$blockElementsArray = array("pre", "div"); // edit to suit your needs
return in_array($nodeName, $blockElementsArray);
}
function hasBlockParent(&$node) {
if (!($node instanceof DOMNode)) {
// return whatever you wish to return on error
// or throw an exception
}
if (is_null($node->parentNode))
return false;
if (isBlockElement($node->parentNode))
return true;
return hasBlockParent($node->parentNode);
}
$myDom = new DOMDocument;
$myDom->loadHTMLFile("in-file");
$myDom->normalizeDocument();
$elems =& $myDom->getElementsByTagName("*");
for ($i = 0; $i < $elems->length; $i++) {
$element =& $elems->item($i);
if (($element->nextSibling->nodeName == "br" && $element->nextSibling->nextSibling->nodeName == "br") && !hasBlockParent($element)) {
$parent =& $element->parentNode;
$parent->removeChild($element->nextSibling->nextSibling);
$parent->removeChild($element->nextSibling);
// check if there are further nodes on the same level
$nSibling;
if (!is_null($element->nextSibling))
$nSibling = $element->nextSibling;
else
$nSibling = NULL;
// delete the old node
$saved = $parent->removeChild($element);
$newNode = $myDom->createElement("p");
$newNode->appendChild($saved);
if ($nSibling == NULL)
$parent->appendChild($newNode);
else
$parent->insertBefore($newNode, $nSibling);
}
}
$myDom->saveHTMLFile("out-file");
?>
This is not really a full solution, but it's a starting point. This is the best I could write during my lunch break, and please bear in mind that the last time I coded in PHP was about 2 years ago (been doing mostly C++ since then). I was not writing it as a full solution but rather to give you a...well, starting point :)
So anyways, the input file:
[dare2be@schroedinger dom-php]$ cat in-file
<strong>bold <br /><br /> text</strong><br /><br /><br />
<a href="#">link</a><br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.
And the output file:
[dare2be@schroedinger dom-php]$ cat out-file
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><strong>bold <br><br> text</strong></p><br><p><a href="#">link</a></p><pre>some code</pre>
I'm a single br, <br> leave me alone.</body></html>
The whole DOCTYPE
mumbo jumbo is a side-effect. The code doesn't do the rest of the things you said, like changing <bold><br><br></bold>
to <bold><br></bold>
. Also, this whole script is a quick draft, but you'll get the idea.
)+/', '
', $str); and that'll convert multiple br to a single one – jon Sep 01 '11 at 22:57