-2

I want remove some div with id or class which contain words comment or share (like:<div id="comment">, <div class="header-comment">, <div id="comment-footer">, <div class="social-share">), something I use

preg_replace('/<div[^>]*(comment|share)[^>]*>(.*?)<\/div>/is', '', $htmls);

Not work. How to do a right regex? Here is some test code, I want to remove comment part and keep content and footer,

$htmls = <<<EOT
<div id="content">
     Main content.
</div>
<div id="comment">
    <ul>
        <li class="comment">
            <div class="header-comment">
                Comment:
                <span class="date-comment">8/11/2012, 21:25</span>
            </div>
            <h4>Some Text</h4>
            <p class="test-comment">Blah~~ Blah~~ Blah~~</p>
            <div class="share">
                <div class="vote">
                    <a class="vota yes" title="Like">2</a>
                    <a class="vota no" title="Unlike">0</a>
                </div>
            </div>
        </li>
        <li class="comment">
            <div class="header-comment">
                Comment:
                <span class="date-comment">8/11/2012, 23:08</span>
            </div>
            <h4>Other Text</h4>
            <p class="test-comment">Blah~~ Blah~~ Blah~~</p>
            <div class="share">
                <div class="vote">
                    <a class="vota yes" title="Like">4</a>
                    <a class="vota no" title="Unlike">0</a>
                </div>
            </div>
        </li>     
     </ul>
</div>
<div id="footer">
     Footer content.
</div>
EOT;

$htmls = preg_replace('/<div[^>]*(comment|share)[^>]*>(.*?)<\/div>/is', '', $htmls);
echo $htmls;
Peter O.
  • 32,158
  • 14
  • 82
  • 96
fish man
  • 2,666
  • 21
  • 54
  • 94
  • 2
    [Beware of parsing HTML with regular expressions, the way Cthulhu wants you to.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – rid Nov 09 '12 at 12:13
  • A baby seal gets horribly killed every time you try to parse HTML with regex. – moonwave99 Nov 09 '12 at 12:15
  • 1
    Html is not a regular language and therefore it is really difficult to use regular expressions to parse it. http://en.wikipedia.org/wiki/Regular_language – John Sobolewski Nov 09 '12 at 12:16

4 Answers4

2

Consider using the DOMDocument functions to parse the HTML, then target the div you don't want and remove it. This will be faster, easier to understand and maintain and possibly faster to write.

Martin Lyne
  • 3,157
  • 2
  • 22
  • 28
1

What i think you should use is DomDocument try :

$dom = new DOMDocument();
$dom->loadHTML($htmls);
$remove = array("comment","share");
$removeList = array();
foreach ( $dom->getElementsByTagName("div") as $div ) {
    if (in_array($div->getAttribute("class"), $remove) || in_array($div->getAttribute("id"), $remove)) {
        $removeList[] = $div;
    }
}

foreach ( $removeList as $div ) {
    $div->parentNode->removeChild($div);
}

$dom->formatOutput = true;
echo "<pre>";
echo htmlentities($dom->saveHTML());
Baba
  • 94,024
  • 28
  • 166
  • 217
  • so if some div like: `div.header-comment` ,`div.social-share`, I should list them all in `$remove = array("comment","share","header-comment","social-share");`? this is so tired to list all in an array. – fish man Nov 09 '12 at 12:51
  • is it possible use `strpos` to instead of `in_array`? if find `comment` or `share` in id or class, remove div? – fish man Nov 09 '12 at 13:03
0

How to do a right regex?

You do so by first identifying all DIVs, extract their texts and then look into that text for your regular expression pattern with preg_match.

However you can spare the part with the regular expression as well and just use the xpath. That is more straight forward in your case.

hakre
  • 193,403
  • 52
  • 435
  • 836
0

Refer this site to test your REGEX http://www.regexplanet.com/advanced/java/index.html

V A S
  • 3,338
  • 4
  • 33
  • 39