PHP regex remove some unwanted div

Question

I want remove some div with id or class which contain words comment or share (like:<div id="comment">, <div class="header-comment">, <div id="comment-footer">, <div class="social-share">), something I use

preg_replace('/<div[^>]*(comment|share)[^>]*>(.*?)<\/div>/is', '', $htmls);

Not work. How to do a right regex? Here is some test code, I want to remove comment part and keep content and footer,

$htmls = <<<EOT
<div id="content">
     Main content.
</div>
<div id="comment">
    <ul>
        <li class="comment">
            <div class="header-comment">
                Comment:
                <span class="date-comment">8/11/2012, 21:25</span>
            </div>
            <h4>Some Text</h4>
            <p class="test-comment">Blah~~ Blah~~ Blah~~</p>
            <div class="share">
                <div class="vote">
                    <a class="vota yes" title="Like">2</a>
                    <a class="vota no" title="Unlike">0</a>
                </div>
            </div>
        </li>
        <li class="comment">
            <div class="header-comment">
                Comment:
                <span class="date-comment">8/11/2012, 23:08</span>
            </div>
            <h4>Other Text</h4>
            <p class="test-comment">Blah~~ Blah~~ Blah~~</p>
            <div class="share">
                <div class="vote">
                    <a class="vota yes" title="Like">4</a>
                    <a class="vota no" title="Unlike">0</a>
                </div>
            </div>
        </li>     
     </ul>
</div>
<div id="footer">
     Footer content.
</div>
EOT;

$htmls = preg_replace('/<div[^>]*(comment|share)[^>]*>(.*?)<\/div>/is', '', $htmls);
echo $htmls;

[Beware of parsing HTML with regular expressions, the way Cthulhu wants you to.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — rid, Nov 09 '12 at 12:13
A baby seal gets horribly killed every time you try to parse HTML with regex. — moonwave99, Nov 09 '12 at 12:15
Html is not a regular language and therefore it is really difficult to use regular expressions to parse it. http://en.wikipedia.org/wiki/Regular_language — John Sobolewski, Nov 09 '12 at 12:16

score 2 · Answer 1 · answered Nov 09 '12 at 12:28

2

Consider using the DOMDocument functions to parse the HTML, then target the div you don't want and remove it. This will be faster, easier to understand and maintain and possibly faster to write.

answered Nov 09 '12 at 12:28

Martin Lyne

3,157
2
22
28

score 1 · Accepted Answer · answered Nov 09 '12 at 12:31

1

What i think you should use is DomDocument try :

$dom = new DOMDocument();
$dom->loadHTML($htmls);
$remove = array("comment","share");
$removeList = array();
foreach ( $dom->getElementsByTagName("div") as $div ) {
    if (in_array($div->getAttribute("class"), $remove) || in_array($div->getAttribute("id"), $remove)) {
        $removeList[] = $div;
    }
}

foreach ( $removeList as $div ) {
    $div->parentNode->removeChild($div);
}

$dom->formatOutput = true;
echo "<pre>";
echo htmlentities($dom->saveHTML());

answered Nov 09 '12 at 12:31

Baba

94,024
28
166
217

so if some div like: `div.header-comment` ,`div.social-share`, I should list them all in `$remove = array("comment","share","header-comment","social-share");`? this is so tired to list all in an array. – fish man Nov 09 '12 at 12:51
is it possible use `strpos` to instead of `in_array`? if find `comment` or `share` in id or class, remove div? – fish man Nov 09 '12 at 13:03

score 0 · Answer 3 · answered Nov 09 '12 at 12:15

How to do a right regex?

You do so by first identifying all DIVs, extract their texts and then look into that text for your regular expression pattern with preg_match.

However you can spare the part with the regular expression as well and just use the xpath. That is more straight forward in your case.

score 0 · Answer 4 · answered Nov 09 '12 at 12:25

0

Refer this site to test your REGEX http://www.regexplanet.com/advanced/java/index.html

answered Nov 09 '12 at 12:25

V A S

3,338
4
33
39

PHP regex remove some unwanted div

4 Answers4