negative lookbehind stopping at quantified whitespace?

Question

I need to insert  tags to surround each list element in a HTML fragment. This must not create nested paragraphs, which is why i want to use lookahead/lookbehind assertions to detect if the content is already enclosed in a paragraph tag.

So far, i've come up with the following code.

This example uses a negative lookbehind assertion to match each </li> closing tag which is not preceeded by a  closing tag and arbitrary whitespace:

$html = <<<EOF
<ul>
        <li>foo</li>
        <li><p>fooooo</p></li>
        <li class="bar"><p class="xy">fooooo</p></li>
        <li>   <p>   fooooo   </p>   </li>
</ul>
EOF;
$html = preg_replace('@(<li[^>]*>)(?!\s*<p)@i', '\1<p>', $html);
$html = preg_replace("@(?<!</p>)(\s*</li>)@i", '</p>\1', $html);
echo $html, PHP_EOL;

which to my surprise results in the following output:

<ul>
    <li><p>foo</p></li>
    <li><p>fooooo</p></li>
    <li class="bar"><p class="xy">fooooo</p></li>
    <li>   <p>   fooooo   </p> </p>  </li>
</ul>

The insertion of the beginning tag works as expected, but note the additional  tag inserted in the last list element!

Can somebody explain why the whitespace (\s*) is totally ignored in the regex when a negative lookbehind assertion is used?

And even more important: what can i try else to achieve the mentioned goal?

When the first white space is tested it fails, then the regex engine will test the next white space and succeed. — Casimir et Hippolyte, Oct 28 '13 at 22:35
About the handling/parsing html with regexps - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Lachezar, Oct 28 '13 at 22:36
@CasimiretHippolyte the result is the same if i reduce the whitespaces to a single one. — Kaii, Oct 28 '13 at 22:37
@Lucho thats the link i use to post too, but i thought this task is so easy that a regex would suffice. also, i would like to understand what happens here - i'm that kind of curious guy and would like to know whats happening here because the regex isn't working as expected. — Kaii, Oct 28 '13 at 22:38
@Kaii: Normal, since `\s*` allows zero white characters, the pattern succeed on the `<` — Casimir et Hippolyte, Oct 28 '13 at 22:39
@CasimiretHippolyte nice catch. confirmed - tested it by replacing `\s*` with `\s+`. But still, how can i change the regex to do what i expect? — Kaii, Oct 28 '13 at 22:41

score 2 · Accepted Answer · edited May 23 '17 at 11:43

Because the regex is not anchored in any way, it is free to be as loose as it likes.

In this case, let's look at how your string can be broken down. the square brackets indicate the attempted match.

... </p>[   </li>] // Fails, lookbehind assertion denies match
... </p> [  </li>] // Succeeds, lookbehind sees a space, not </p>

So you see the match succeeds simply by matching one less space, which is why you see a space between the two  in the result.

There's no easy fix for this in Regex. THE PONY HE COMES. So instead try using a parser.

$dom = new DOMDocument();
$dom->loadHTML($html);
$lis = $dom->getElementsByTagName('li');
foreach($lis as $li) {
    if( !$li->getElementsByTagName('p')->length) {
        $p = $dom->createElement("p");
        while($li->firstChild) $p->appendChild($li->firstChild);
        $li->appendChild($p);
    }
}
$output = $dom->saveHTML($dom->getElementsByTagName('body')->item(0));
$output = substr($output,strlen("<body>"),-strlen("</body>")); // strip body tag

nice visualized explanation. thank you very much! will keep the anchoring in mind for the future. — Kaii, Oct 28 '13 at 22:42
fwiw Perl can do this; it lifted the restriction on the width of lookbehinds. i wouldn't recommend it though :) — Eevee, Oct 28 '13 at 22:45

score 1 · Answer 2 · edited May 23 '17 at 12:29

1

You have this:

</p>   </li>

And your regex doesn't match here:

</p>   </li>
    ^

because there's a  immediately preceding. But it DOES match here:

</p>   </li>
     ^

because the preceding text is not , but .

You want an HTML parser. PHP comes with several, but I'm not much of a PHP dev so I can't recommend any in particular. See this question for some recommendations.

edited May 23 '17 at 12:29

Community

1
1

answered Oct 28 '13 at 22:38

Eevee

47,412
11
95
127

StackSlave · Answer 3 · 2013-10-29T00:03:24.090

0

This might help.

$html = preg_replace('@(<li[^>]*>)([^</li>]+)(?!\s*<p)@i', '$1<p>$2</p>', $html);

edited Oct 29 '13 at 00:03

answered Oct 28 '13 at 23:33

StackSlave

10,613
2
18
35

actually i prefer to use the DOM parser to solve this now because it gives me much more flexibility and better code readability. I was just curious what was happening in the background that i was missing. the approved answer explains this very well. – Kaii Oct 29 '13 at 00:01

negative lookbehind stopping at quantified whitespace?

3 Answers3