1

I'm inserting HTML paragraphs (<p></p>) into a piece of text, like this:

$text = '<p>' . preg_replace("/(\n|\r|\r\n)+/i", "</p><p>", $text) . '</p>' ;

Which seems to work well, except I don't want any paragraphs within <code></code> blocks since content within those blocks are pre-formatted (using a white-space:pre; style).

I'm not sure how best to handle this. I've tried to remove any such tags after the above line of code, but that's causing me some trouble and I figure it would be much better not to insert them in the first place.

Is it possible and/or practical to make the exclusion in the regex above? If not, what else?

Thanks

Edit: Came up with this code based on Nameless' answer below. It appears to work.

$chunks = preg_split("/(<code>.*?<\/code>)/is", $text, -1, PREG_SPLIT_DELIM_CAPTURE) ;
$text = '' ;
foreach($chunks as $chunk) {
    if (preg_match("/^<code>/i", $chunk)) {
        $text .= $chunk ;
    } else {
        $text .= '<p>' . preg_replace("/(\n|\r)+/i", "</p><p>", $chunk) . '</p>' ;
    }
}
  • Sorry. This text "Line one\n\n\rLine two\r\nLine three\nLine four" would become "

    Line one

    Line two

    Line three

    Line four

    ". And I know CSS is for styling, but the HTML still tells CSS where to apply those styles.
    –  Aug 20 '11 at 15:04
  • You want to use an HTML toolkit for this. See http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662. – Gordon Aug 21 '11 at 13:26

2 Answers2

1

Well, it is possible with PCRE regex engine. Yet, highly irrational and resourse-heavy.

$text = '<p>' . preg_replace("/(\n|\r|\r\n)+(?!(.(?!<code>))*<\/code>)|(\n|\r|\r\n)+(?=<code>)/is", "</p><p>", $text) . '</p>' ;

Using DOM is probably the best solution, if you can spend some additional RAM on this operation. If not, you could split your string beforehand in chunks of <code> ... </code> and everything else, than use your regex on chunks not in <code>, than glue it back into string.

Nameless
  • 2,306
  • 4
  • 23
  • 28
  • Thanks. The idea of splitting it up seems to be a good one. I edited in my solution based on that suggestion. It seems to work. I don't know if it's the most efficient way, maybe I'll look into DOM sometime. –  Aug 20 '11 at 15:55
-1

Never ever ever ever ever ever try to parse HTML with regex.

Use for example PHP's DOM: http://php.net/manual/en/book.dom.php

:)

PeeHaa
  • 71,436
  • 58
  • 190
  • 262
  • Is that part of the standard installation of PHP, or something I need to install separately? If the latter, I may not have access to it (could check, of course). –  Aug 20 '11 at 15:06
  • @MCXXII: The `libxml` needs to be installed. Although it is installed by default I think. You can check your installed extensions by doing: `phpinfo();` – PeeHaa Aug 20 '11 at 15:09
  • It's installed. I may look into it at some point. Thanks for the link. –  Aug 20 '11 at 15:57