-1

I need to insert <p> tags to surround each list element in a HTML fragment. This must not create nested paragraphs, which is why i want to use lookahead/lookbehind assertions to detect if the content is already enclosed in a paragraph tag.

So far, i've come up with the following code.

This example uses a negative lookbehind assertion to match each </li> closing tag which is not preceeded by a </p> closing tag and arbitrary whitespace:

$html = <<<EOF
<ul>
        <li>foo</li>
        <li><p>fooooo</p></li>
        <li class="bar"><p class="xy">fooooo</p></li>
        <li>   <p>   fooooo   </p>   </li>
</ul>
EOF;
$html = preg_replace('@(<li[^>]*>)(?!\s*<p)@i', '\1<p>', $html);
$html = preg_replace("@(?<!</p>)(\s*</li>)@i", '</p>\1', $html);
echo $html, PHP_EOL;

which to my surprise results in the following output:

<ul>
    <li><p>foo</p></li>
    <li><p>fooooo</p></li>
    <li class="bar"><p class="xy">fooooo</p></li>
    <li>   <p>   fooooo   </p> </p>  </li>
</ul>

The insertion of the beginning tag works as expected, but note the additional </p> tag inserted in the last list element!

Can somebody explain why the whitespace (\s*) is totally ignored in the regex when a negative lookbehind assertion is used?

And even more important: what can i try else to achieve the mentioned goal?

Kaii
  • 20,122
  • 3
  • 38
  • 60
  • 1
    perhaps consider using an HTML parser instead. – Eevee Oct 28 '13 at 22:34
  • When the first white space is tested it fails, then the regex engine will test the next white space and succeed. – Casimir et Hippolyte Oct 28 '13 at 22:35
  • 1
    About the handling/parsing html with regexps - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Lachezar Oct 28 '13 at 22:36
  • @CasimiretHippolyte the result is the same if i reduce the whitespaces to a single one. – Kaii Oct 28 '13 at 22:37
  • @Lucho thats the link i use to post too, but i thought this task is so easy that a regex would suffice. also, i would like to understand what happens here - i'm that kind of curious guy and would like to know whats happening here because the regex isn't working as expected. – Kaii Oct 28 '13 at 22:38
  • 1
    @Kaii: Normal, since `\s*` allows zero white characters, the pattern succeed on the `<` – Casimir et Hippolyte Oct 28 '13 at 22:39
  • @CasimiretHippolyte nice catch. confirmed - tested it by replacing `\s*` with `\s+`. But still, how can i change the regex to do what i expect? – Kaii Oct 28 '13 at 22:41

3 Answers3

2

Because the regex is not anchored in any way, it is free to be as loose as it likes.

In this case, let's look at how your string can be broken down. the square brackets indicate the attempted match.

... </p>[   </li>] // Fails, lookbehind assertion denies match
... </p> [  </li>] // Succeeds, lookbehind sees a space, not </p>

So you see the match succeeds simply by matching one less space, which is why you see a space between the two </p> in the result.

There's no easy fix for this in Regex. THE PONY HE COMES. So instead try using a parser.

$dom = new DOMDocument();
$dom->loadHTML($html);
$lis = $dom->getElementsByTagName('li');
foreach($lis as $li) {
    if( !$li->getElementsByTagName('p')->length) {
        $p = $dom->createElement("p");
        while($li->firstChild) $p->appendChild($li->firstChild);
        $li->appendChild($p);
    }
}
$output = $dom->saveHTML($dom->getElementsByTagName('body')->item(0));
$output = substr($output,strlen("<body>"),-strlen("</body>")); // strip body tag
Community
  • 1
  • 1
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • nice visualized explanation. thank you very much! will keep the anchoring in mind for the future. – Kaii Oct 28 '13 at 22:42
  • fwiw Perl can do this; it lifted the restriction on the width of lookbehinds. i wouldn't recommend it though :) – Eevee Oct 28 '13 at 22:45
1

You have this:

</p>   </li>

And your regex doesn't match here:

</p>   </li>
    ^

because there's a </p> immediately preceding. But it DOES match here:

</p>   </li>
     ^

because the preceding text is not </p>, but .

You want an HTML parser. PHP comes with several, but I'm not much of a PHP dev so I can't recommend any in particular. See this question for some recommendations.

Community
  • 1
  • 1
Eevee
  • 47,412
  • 11
  • 95
  • 127
0

This might help.

$html = preg_replace('@(<li[^>]*>)([^</li>]+)(?!\s*<p)@i', '$1<p>$2</p>', $html);
StackSlave
  • 10,613
  • 2
  • 18
  • 35
  • actually i prefer to use the DOM parser to solve this now because it gives me much more flexibility and better code readability. I was just curious what was happening in the background that i was missing. the approved answer explains this very well. – Kaii Oct 29 '13 at 00:01