php regex find string between line start and empty line without lines that starts with any html tags

Question

hello i have to get any lines without html tags into this format

<p>lorem ipsum</p>

e.g.

hello world

<h2>lol</h2>

lorem ipsum
dolor sit
amet

consetetur

should parsed to

<p>hello world</p>

<h2>lol</h2>

<p>lorem ipsum
dolor sit
amet</p>

<p>consetetur</p>

i tried this with the php function preg_replace();

does someone can help?

P.S. I'll trie to get this syntax into html

# header 1 // <h1>header 1</h1>
## header 2 // <h2>header 2</h2>

and all lines without header should parse into

... my headers will be parsed but the paragraphs not

Remember the obligatory http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :) — Shadikka, Jun 07 '11 at 12:08
/^[^(<)(.+)(>)](.*)(^[\r?\n?]$)/m replaced with
\1
this is the regex i tried in different variants — craver, Jun 07 '11 at 12:09
The input already contains HTML tags, so your question is a bit imprecise on what you would like to achieve. — hakre, Jun 07 '11 at 12:11

score 1 · Answer 1 · answered Jun 07 '11 at 12:30

1

This is a bit verbose, but it should be solid. It uses DOMDocument rather than regex:

$dom = new DOMDocument;
$dom->loadXML('<root>' . $yourContent .'</root>');
$xpath = new DOMXPath($dom);

$nodes = $xpath->query('/root/text()');

function wrapnode ($node) {
    global $dom;

    $p = $dom->createElement('p');
    $node->parentNode->replaceChild($p, $node);
    $p->appendChild($node);
}

foreach ($nodes as $node) {
    if ($node->nodeType === XML_TEXT_NODE) {
        $node->nodeValue = trim($node->nodeValue);

        while ($location = strpos($node->nodeValue, "\n\n")) {
            $newnode = $node->splitText($location);
            wrapnode($node);

            $node = $newnode;
            $node->nodeValue = trim($node->nodeValue);
        }

        wrapnode($node);
    }
}

echo $dom->saveXML();

answered Jun 07 '11 at 12:30

lonesomeday

233,373
50
316
318

`$node->nodeValue = trim($node->nodeValue);` will take care that `"\n\n"` is never found at the end of it so renders most of the code useless - if I have spotted that right. – hakre Jun 07 '11 at 21:44
@hakre Yes, the `\n\n` are removed by this code. They are ignored in rendered HTML anyway. This code breaks on `\n\n` in the *middle* of strings. – lonesomeday Jun 07 '11 at 21:47
but doesn't that mean that simple lines of text will be converted into paragraphs as well? – hakre Jun 07 '11 at 21:50
What do you mean by "simple lines of text"? – lonesomeday Jun 07 '11 at 21:58
Those w/o "\n" at the beginning or end. (for the example data given by the OP it works.) – hakre Jun 07 '11 at 22:01
@hakre Could you provide some data where it wouldn't work? I'm afraid I don't understand your point. – lonesomeday Jun 07 '11 at 22:06
sure, take this and check the output: `$yourContent = "hello world\n\n
lol
test";`. Will wrap test into p while test has not line breaks surrounding. Typically I would says double line break = p. – hakre Jun 07 '11 at 22:09
@hakre *shrug* That seems like a good thing to me. All unwrapped text is wrapped in `p` elements. I may have misinterpreted the question, but it is fairly vague. – lonesomeday Jun 07 '11 at 22:11
I think the question is not precise, so nobody could really say. I'll run a test converting the HTML 2 into domdocument (my suggestion below) to see how it performs. – hakre Jun 07 '11 at 22:26
Works, added it to my answer. – hakre Jun 07 '11 at 22:42

hakre · Answer 2 · 2011-06-07T22:40:55.717

As far as valid HTML 2.0 is concerned, <p> does not need to be a pair. So to create HTML of the input HTML with additional paragraphs per a double line break, it's very simple:

$html = str_replace("\n\n", '<p>', $html);

Keep in mind that this solution is very specific to the input and the output, so it might solve part of the scenario in your question only. However I could not get enough information from your question to give a better answer.

As far as HTML 4.0.1 is concerned, this can be created with ease out of it:

$html = str_replace("\n\n", "<p>", $yourContent);
$dom = new DOMDocument;
$dom->loadHTML($html);
echo $dom->saveHtml();

DomDocument can convert the HTML 2 into HTML 4.0.1 and will add all needed HTML elements like doctype, html and body. only the head and title is missing.

Supported by every browser out there. Works like a charm. ;) — hakre, Jun 07 '11 at 12:29

score 0 · Answer 3 · answered Jun 07 '11 at 12:27

0

This works in java:

input.replaceAll("(?<=\\n\\n)(?=\\w)", "<p>").replaceAll("(?<=\\w)(?=\\n\\n)", "</p>");

However it's a bit brittle: It does two replacements that might not be connected.

answered Jun 07 '11 at 12:27

Bohemian

412,405
93
575
722

php regex find string between line start and empty line without lines that starts with any html tags

3 Answers3

lol