0

so i had this html

<html>
<head>...</head>
<body>
(some js and css)
    <div class="no_remove">(content)</div>
    <div class="no_remove">(content that i didn't want to remove)
        <div class="remove">
            <span>(content)</span>
            <span>(content)</span>
            <span>(content)</span>
            <div class="other1">(content)</div>
            <div class="other2">(content)</div>
            <div class="other3">(content)</div>
        </div>
    </div>
</body>
</html>

and php

$text = file_get_contents($link);
$dom = new DOMDocument();
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[@class="no_remove"]');
$result = $dom->saveXML($div->item(1));
$result2 = preg_replace('#<div class="remove">(.*?)</div>#', ' ', $result);
echo $result2;

dom xpath did its job perfectly,
but the "preg_replace" did not remove div with class "remove"
can i get some enlightenment from regex master or others that can?

sorry bad english

sukri
  • 3
  • 1
  • 5
  • Somewhat related: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Isaac May 18 '18 at 02:40

2 Answers2

2

You likely got to specify the multi line modifier i.e. s

$result2 = preg_replace('#<div class="remove">(.*?)</div>#s', ' ', $result);

Or you can use [\s\S] instead of . to match across multiple lines. So,

$result2 = preg_replace('#<div class="remove">([\s\S]*?)</div>#', ' ', $result);

Also, I normally would use \s+ instead of adding a space, just in case the html has multiple spaces.. so like:

$result2 = preg_replace('#<div\s+class="remove">([\s\S]*?)</div>#', ' ', $result);

You can also try something like this to handle multiple attributes and other types of quotes:

$result2 = preg_replace('#<div\b[^>]+\bclass\s*=\s*[\'\"]remove[\'\"][^>]*>([\s\S]*?)</div>#', ' ', $result);

*QUICK EDIT: I added \b to identify the border of a word, so an attribute like data-classwon't get matched instead of the class attribute.

D.B.
  • 1,792
  • 1
  • 10
  • 13
  • wow, this is somewhat can remove some of div class="remove" contents. but there's still other div inside it that still there – sukri May 18 '18 at 02:53
  • I did not totally follow you. Did you get through? Also, you can make the regex even more adaptive in case there are other attributes in the tag... like `$result2 = preg_replace('#
    ]+class\s*=\s*[\'\"]remove[\'\"][^>]*>([\s\S]*?)
    #', ' ', $result);`
    – D.B. May 18 '18 at 02:57
  • sorry bad english ._. your regex is better than mine but not all the contents are removed. there're still some left – sukri May 18 '18 at 03:01
  • Ok, I made some updates. Can you give an example of one it did not match, if the new regex still don't? – D.B. May 18 '18 at 03:03
  • Oh wait.. I see you modified your example.. in this case, the end `` would not be the correct one, because the regex won't know which one is the actual closing tag depending on the situation. This is why the DOM functions would be a better fit for modifying HTML. – D.B. May 18 '18 at 03:05
  • yes sorry i edit it because there're more than 1 div in it. so how is it? – sukri May 18 '18 at 03:13
  • i just use more preg_replace to get it done! thanks ^^ – sukri May 18 '18 at 03:25
  • Ok, interesting approach Thanks for the points. Quick note, regex should never be trusted. Based on different situations, it can have unexpected results. My rule is to only use it in places, that if it fails, it does not create a big problem. In your example above, it might be worth mentioning, that you can quickly remove the element and it's content via JavaScript as well. Just in case that's an option that might suit your needs. – D.B. May 18 '18 at 03:31
2

Here is how you continue to use the right tool -- use DomDocument/Xpath to remove the unwanted div based on class name: (don't resort to regex)

Code: (Demo)

$html = <<<HTML
<html>
<head>...</head>
<body>
(some js and css)
    <div class="no_remove">(content)</div>
    <div class="no_remove">(content that i didn't want to remove)
        <div class="remove">
            <span>(content)</span>
            <span>(content)</span>
            <span>(content)</span>
            <div class="other1">(content)</div>
            <div class="other2">(content)</div>
            <div class="other3">(content)</div>
        </div>
    </div>
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$dom=new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//div[@class="remove"]') as $div) {
    $div->parentNode->removeChild($div);
}
echo $dom->saveHTML();

Output:

<html>
<head></head><p>...
</p><body>
(some js and css)
    <div class="no_remove">(content)</div>
    <div class="no_remove">(content that i didn't want to remove)

    </div>
</body>
</html>
mickmackusa
  • 43,625
  • 12
  • 83
  • 136