0

hello i have to get any lines without html tags into this format

<p>lorem ipsum</p>

e.g.

hello world

<h2>lol</h2>

lorem ipsum
dolor sit
amet

consetetur

should parsed to

<p>hello world</p>

<h2>lol</h2>

<p>lorem ipsum
dolor sit
amet</p>

<p>consetetur</p>

i tried this with the php function preg_replace();

does someone can help?

P.S. I'll trie to get this syntax into html

# header 1 // <h1>header 1</h1>
## header 2 // <h2>header 2</h2>

and all lines without header should parse into

... my headers will be parsed but the paragraphs not
craver
  • 1
  • 1
  • 1
    Remember the obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :) – Shadikka Jun 07 '11 at 12:08
  • /^[^(<)(.+)(>)](.*)(^[\r?\n?]$)/m replaced with

    \1

    this is the regex i tried in different variants
    – craver Jun 07 '11 at 12:09
  • The input already contains HTML tags, so your question is a bit imprecise on what you would like to achieve. – hakre Jun 07 '11 at 12:11

3 Answers3

1

This is a bit verbose, but it should be solid. It uses DOMDocument rather than regex:

$dom = new DOMDocument;
$dom->loadXML('<root>' . $yourContent .'</root>');
$xpath = new DOMXPath($dom);

$nodes = $xpath->query('/root/text()');

function wrapnode ($node) {
    global $dom;

    $p = $dom->createElement('p');
    $node->parentNode->replaceChild($p, $node);
    $p->appendChild($node);
}

foreach ($nodes as $node) {
    if ($node->nodeType === XML_TEXT_NODE) {
        $node->nodeValue = trim($node->nodeValue);

        while ($location = strpos($node->nodeValue, "\n\n")) {
            $newnode = $node->splitText($location);
            wrapnode($node);

            $node = $newnode;
            $node->nodeValue = trim($node->nodeValue);
        }

        wrapnode($node);
    }
}

echo $dom->saveXML();
lonesomeday
  • 233,373
  • 50
  • 316
  • 318
  • `$node->nodeValue = trim($node->nodeValue);` will take care that `"\n\n"` is never found at the end of it so renders most of the code useless - if I have spotted that right. – hakre Jun 07 '11 at 21:44
  • @hakre Yes, the `\n\n` are removed by this code. They are ignored in rendered HTML anyway. This code breaks on `\n\n` in the *middle* of strings. – lonesomeday Jun 07 '11 at 21:47
  • but doesn't that mean that simple lines of text will be converted into paragraphs as well? – hakre Jun 07 '11 at 21:50
  • What do you mean by "simple lines of text"? – lonesomeday Jun 07 '11 at 21:58
  • Those w/o "\n" at the beginning or end. (for the example data given by the OP it works.) – hakre Jun 07 '11 at 22:01
  • @hakre Could you provide some data where it wouldn't work? I'm afraid I don't understand your point. – lonesomeday Jun 07 '11 at 22:06
  • sure, take this and check the output: `$yourContent = "hello world\n\n

    lol

    test";`. Will wrap test into p while test has not line breaks surrounding. Typically I would says double line break = p.
    – hakre Jun 07 '11 at 22:09
  • @hakre *shrug* That seems like a good thing to me. All unwrapped text is wrapped in `p` elements. I may have misinterpreted the question, but it is fairly vague. – lonesomeday Jun 07 '11 at 22:11
  • I think the question is not precise, so nobody could really say. I'll run a test converting the HTML 2 into domdocument (my suggestion below) to see how it performs. – hakre Jun 07 '11 at 22:26
  • Works, added it to my answer. – hakre Jun 07 '11 at 22:42
0

As far as valid HTML 2.0 is concerned, <p> does not need to be a pair. So to create HTML of the input HTML with additional paragraphs per a double line break, it's very simple:

$html = str_replace("\n\n", '<p>', $html);

Keep in mind that this solution is very specific to the input and the output, so it might solve part of the scenario in your question only. However I could not get enough information from your question to give a better answer.

As far as HTML 4.0.1 is concerned, this can be created with ease out of it:

$html = str_replace("\n\n", "<p>", $yourContent);
$dom = new DOMDocument;
$dom->loadHTML($html);
echo $dom->saveHtml();

DomDocument can convert the HTML 2 into HTML 4.0.1 and will add all needed HTML elements like doctype, html and body. only the head and title is missing.

hakre
  • 193,403
  • 52
  • 435
  • 836
0

This works in java:

input.replaceAll("(?<=\\n\\n)(?=\\w)", "<p>").replaceAll("(?<=\\w)(?=\\n\\n)", "</p>");

However it's a bit brittle: It does two replacements that might not be connected.

Bohemian
  • 412,405
  • 93
  • 575
  • 722