4

Someone has asked a similar question, but the accepted answer doesn't meet my requirements.

Input:

<strong>bold <br /><br /> text</strong><br /><br /><br />
<a href="#">link</a><br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.

Expected output:

<p><strong>bold <br /> text</strong><br /></p>
<p><a href="#">link</a><br /></p>
<pre>some code</pre>
<p>I'm a single br, <br /> leave me alone.</p>

The accepted answer I mentioned above will convert multiple br to p, and at last wrap all the input with another p. But in my case, you can't wrap pre inside a p tag. Can anyone help?

update

the expected output before this edit was a little bit confusing. the whole point is:

  1. convert multiple br to a single one (achieved with preg_replace('/(<br />)+/', '<br />', $str);)

  2. check for inline elements and unwrapped text (there's no parent element in this case, input is from $_POST) and wrap with <p>, leave block level elements alone.

Community
  • 1
  • 1
jon
  • 61
  • 5
  • Idle curiosity: why do you want to do this? – Mark Elliot Sep 01 '11 at 22:42
  • @jon: While I do not have the answer to your question, here's a good starting point: http://codepad.org/cIzTWlGF – Joseph Silber Sep 01 '11 at 22:47
  • when a form is submitted with a textarea, I got \n (newlines), and I can convert \n to br, but it doesn't look good, so I'm trying to convert multiple br to p. – jon Sep 01 '11 at 22:49
  • @joseph your code can be simply written as preg_replace('/(
    )+/', '
    ', $str); and that'll convert multiple br to a single one
    – jon Sep 01 '11 at 22:57
  • @jon: You're so (embarrassingly) right. I'm pretty tired... – Joseph Silber Sep 01 '11 at 23:10
  • why not switch from a text area to a full editor like ckeditor (http://ckeditor.com/) or tinymce –  Sep 01 '11 at 23:18
  • @Dagon ckeditor and tinymce are written with client side Javascript, and they cannot be trusted. the input can be easily "hacked" with tools like firebug. we need to validate the input server side anyways. – jon Sep 03 '11 at 03:31
  • nothing in your question mentions validation, it was all about formatting which is what these tools are for. –  Sep 03 '11 at 09:02
  • @Dagon I think I did mention PHP in the subject, so JS is not an option. – jon Sep 03 '11 at 14:29

2 Answers2

3

Do not use regex. Why? See: RegEx match open tags except XHTML self-contained tags

Use proper DOM manipulators. See: http://php.net/manual/en/book.dom.php

EDIT: I'm not really a fan of giving cookbook-recipes, so here's a solution for changing double <br />'s to text wrapped in <p></p>:

script.php:
<?php

function isBlockElement($nodeName) {
  $blockElementsArray = array("pre", "div"); // edit to suit your needs
  return in_array($nodeName, $blockElementsArray);
}

function hasBlockParent(&$node) {
  if (!($node instanceof DOMNode)) {
    // return whatever you wish to return on error
    // or throw an exception
  }
  if (is_null($node->parentNode))
    return false;

  if (isBlockElement($node->parentNode))
    return true;

  return hasBlockParent($node->parentNode);
}

$myDom = new DOMDocument;
$myDom->loadHTMLFile("in-file");
$myDom->normalizeDocument();


$elems =& $myDom->getElementsByTagName("*");
for ($i = 0; $i < $elems->length; $i++) {
  $element =& $elems->item($i);
  if (($element->nextSibling->nodeName == "br" && $element->nextSibling->nextSibling->nodeName == "br") && !hasBlockParent($element)) {
    $parent =& $element->parentNode;
    $parent->removeChild($element->nextSibling->nextSibling);
    $parent->removeChild($element->nextSibling);

    // check if there are further nodes on the same level
    $nSibling;
    if (!is_null($element->nextSibling))
      $nSibling = $element->nextSibling;
    else
      $nSibling = NULL;

    // delete the old node
    $saved = $parent->removeChild($element);
    $newNode = $myDom->createElement("p");
    $newNode->appendChild($saved);
    if ($nSibling == NULL)
      $parent->appendChild($newNode);
    else 
      $parent->insertBefore($newNode, $nSibling);
  }
}

$myDom->saveHTMLFile("out-file");

?>

This is not really a full solution, but it's a starting point. This is the best I could write during my lunch break, and please bear in mind that the last time I coded in PHP was about 2 years ago (been doing mostly C++ since then). I was not writing it as a full solution but rather to give you a...well, starting point :)

So anyways, the input file:

[dare2be@schroedinger dom-php]$ cat in-file
<strong>bold <br /><br /> text</strong><br /><br /><br />
<a href="#">link</a><br /><br />
<pre>some code</pre>
I'm a single br, <br /> leave me alone.

And the output file:

[dare2be@schroedinger dom-php]$ cat out-file 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><strong>bold <br><br> text</strong></p><br><p><a href="#">link</a></p><pre>some code</pre>
I'm a single br, <br> leave me alone.</body></html>

The whole DOCTYPE mumbo jumbo is a side-effect. The code doesn't do the rest of the things you said, like changing <bold><br><br></bold> to <bold><br></bold>. Also, this whole script is a quick draft, but you'll get the idea.

Community
  • 1
  • 1
  • 1
    will give +1 if you have the solution in DOM – ajreal Sep 02 '11 at 09:17
  • +1 for DOM. BTW PCRE has recursive regex support and could parse HTML (but this would actually be more difficult than using the DOM) – Arnaud Le Blanc Sep 02 '11 at 10:37
  • @ajreal There you go xD You owe me a "+1"... and a lunch break :) –  Sep 02 '11 at 10:38
  • given ... I TRUST U THE ABOVE IS WORKING! – ajreal Sep 02 '11 at 10:40
  • hummm...that seems much complex than I thought. I can convert multiple br to a single one with regex: preg_replace('/(
    )+/', '
    ', $str); but the hard part is to distinguish between inline and block elements. that is...if an element without a block level parent, wrap it with

    , leave it alone otherwise.

    – jon Sep 02 '11 at 13:57
  • @jon I've edited the solution, you just have to define the "block" elements. As for the "complex" part - it's just the way it is. Sometimes, you may get away with a simple hack which may work in your special case, but if you want your code to be maintainable and readable - you just have to stick with those less "hacky" methods. Just see how few lines of code it took me to apply changes. –  Sep 03 '11 at 01:33
2

Alright, I'v got myself an answer, and I believe this is gonna work really well.

It's from WordPress...the wpautop function.

I'v tested it with the input (from my question), and the output is -almost- the same as I expected, I just need to modify it a bit to fit my needs.

Thanks dare2be, but I'm not very familiar with DOM manipulator in PHP.

jon
  • 61
  • 5
  • All right, I mean whatever works for you. Just wanted to warn you - if you go down this path, sooner or later, Bad Things™ will happen. –  Sep 03 '11 at 13:57
  • @dare2be ahhh...so your warning applies to WorePress too? – jon Sep 03 '11 at 14:32
  • No, it's a general warning against regex. Just look at the definition of `wpautop`. It's spiked with `preg_replace()` calls. Plus, pulling code from projects has this disadvantage that if you don't follow their updates, you may miss a security update and your page will be vulnerable. –  Sep 08 '11 at 12:41