Stripping one div out of content with Regex

Question

I'm trying to strip one particular div (and it's inner contents) out of a block of content, however it isn't quite working.

Regex:

/<div class="greybackground_desktop".*>(.*)<\/div>/s

Preg_replace:

preg_replace($pattern, "", $holder, -1, $count );

Now, the regex does indeed strip out my div, however if there are any other following closing div tags, it'll strip them out too and any other content inside it.

e.g.

<p>some random text</p>

<div class="greybackground_desktop" style="background-color:#EFEFEF;">
<!-- /49527960/CSF_Article_Middle -->
<div style="padding-bottom:10px; padding-top: 10px; text-align:center;" id='div-gpt-ad-1441883689230-0'>
<script type='text/javascript'>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1441883689230-0'); });
</script>
</div>
</div>

<p>some more text</p>

<div><p>example of content that will be incorrectly removed</p></div>

<p>Text that follows</p>

This will result in the following output:

some random text

Text that follows

What I am wanting to see is:

some random text

some more text

example of content that will be incorrectly removed

Text that follows

Any ideas?

@Sami C please ensure you know how to accept correct answers — David, Apr 30 '16 at 09:51

Jan · Accepted Answer · 2016-04-30T06:17:43.787

3

Use a parser like DOMDocument instead. Consider this code:

<?php
$dom = new DOMDocument();
$dom->loadHTML($your_html_here);

$xpath = new DOMXpath($dom);

foreach ($xpath->query("//div[@class='greybackground_desktop']") as $div)
    $div->parentNode->removeChild($div);

echo $dom->saveHTML();
?>

The script loads your html, looks for elements with div.greybackground_desktop and removes these. A demo can be found on ideone.com.

edited Apr 30 '16 at 06:17

answered Apr 29 '16 at 23:40

Jan

42,290
8
54
79

Nice answer, but you should change `echo $dom->saveXML();` for `echo $dom->saveHTML();` ;) – Pedro Lobito Apr 29 '16 at 23:52
1

@PedroLobito: Right you are, muito obrigado :) – Jan Apr 30 '16 at 06:18

Pedro Lobito · Answer 2 · 2016-04-30T00:17:10.163

The correct way to do this is using an Html Parser like DOMDocument, here's an example:

$holder = <<< LOL
<p>some random text</p>
<div class="greybackground_desktop" style="background-color:#EFEFEF;">
<!-- /49527960/CSF_Article_Middle -->
<div style="padding-bottom:10px; padding-top: 10px; text-align:center;" id='div-gpt-ad-1441883689230-0'>
<script type='text/javascript'>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1441883689230-0'); });
</script>
</div>
</div>
<p>some more text</p>
<div><p>example of content that will be incorrectly removed</p></div>
<p>Text that follows</p>
LOL;
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($holder);
//get the div from dom
if($div = $dom->getElementsByTagName('div')->item(0)) {
   //remove the node by telling the parent node to remove the child
   $div->parentNode->removeChild($div);
   //save the new document
   echo $dom->saveHTML();
}

Ideone DOMDocument Demo

If you really want to use a regex, Use a lazy one .*? instead of greedy .*, i.e.:

$result = preg_replace('%<div class="greybackground_desktop".*?</div>\s+</div>%si', '', $holder);

Ideone Demo

Read more about regex repetition, specifically "Laziness Instead of Greediness"

http://www.regular-expressions.info/repeat.html

This is **very** (!) error-prone, look at this [regex101.com example](https://regex101.com/r/dK7jY6/1) where I only removed one newline (obviously, the `HTML` is still valid). — Jan, Apr 29 '16 at 23:44
The first part of my answer works with the example provided by the OP, regex is not the way to go with html, we all know that... that's why I've posted the second part of my answer. — Pedro Lobito, Apr 29 '16 at 23:48

Stripping one div out of content with Regex

2 Answers2