0

I'm trying to strip one particular div (and it's inner contents) out of a block of content, however it isn't quite working.

Regex:

/<div class="greybackground_desktop".*>(.*)<\/div>/s

Preg_replace:

preg_replace($pattern, "", $holder, -1, $count );

Now, the regex does indeed strip out my div, however if there are any other following closing div tags, it'll strip them out too and any other content inside it.

e.g.

<p>some random text</p>

<div class="greybackground_desktop" style="background-color:#EFEFEF;">
<!-- /49527960/CSF_Article_Middle -->
<div style="padding-bottom:10px; padding-top: 10px; text-align:center;" id='div-gpt-ad-1441883689230-0'>
<script type='text/javascript'>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1441883689230-0'); });
</script>
</div>
</div>

<p>some more text</p>

<div><p>example of content that will be incorrectly removed</p></div>

<p>Text that follows</p>

This will result in the following output:

some random text

Text that follows

What I am wanting to see is:

some random text

some more text

example of content that will be incorrectly removed

Text that follows

Any ideas?

Sami.C
  • 561
  • 1
  • 11
  • 24

2 Answers2

3

Use a parser like DOMDocument instead. Consider this code:

<?php
$dom = new DOMDocument();
$dom->loadHTML($your_html_here);

$xpath = new DOMXpath($dom);

foreach ($xpath->query("//div[@class='greybackground_desktop']") as $div)
    $div->parentNode->removeChild($div);

echo $dom->saveHTML();
?>

The script loads your html, looks for elements with div.greybackground_desktop and removes these. A demo can be found on ideone.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

The correct way to do this is using an Html Parser like DOMDocument, here's an example:

$holder = <<< LOL
<p>some random text</p>
<div class="greybackground_desktop" style="background-color:#EFEFEF;">
<!-- /49527960/CSF_Article_Middle -->
<div style="padding-bottom:10px; padding-top: 10px; text-align:center;" id='div-gpt-ad-1441883689230-0'>
<script type='text/javascript'>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1441883689230-0'); });
</script>
</div>
</div>
<p>some more text</p>
<div><p>example of content that will be incorrectly removed</p></div>
<p>Text that follows</p>
LOL;
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($holder);
//get the div from dom
if($div = $dom->getElementsByTagName('div')->item(0)) {
   //remove the node by telling the parent node to remove the child
   $div->parentNode->removeChild($div);
   //save the new document
   echo $dom->saveHTML();
}

Ideone DOMDocument Demo



If you really want to use a regex, Use a lazy one .*? instead of greedy .*, i.e.:

$result = preg_replace('%<div class="greybackground_desktop".*?</div>\s+</div>%si', '', $holder);

Ideone Demo


Read more about regex repetition, specifically "Laziness Instead of Greediness"

http://www.regular-expressions.info/repeat.html


Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • This is **very** (!) error-prone, look at this [regex101.com example](https://regex101.com/r/dK7jY6/1) where I only removed one newline (obviously, the `HTML` is still valid). – Jan Apr 29 '16 at 23:44
  • The first part of my answer works with the example provided by the OP, regex is not the way to go with html, we all know that... that's why I've posted the second part of my answer. – Pedro Lobito Apr 29 '16 at 23:48