2

I'm trying to parse some HTML with PHP as an exercise, outputting it as just text, and I've hit a snag. I'd like to remove any tags that are hidden with style="display: none;" - bearing in mind that the tag may contain other attributes and style properties.

The code I have so far is this:

$page = preg_replace("#<([a-z]+).*?style=\".*?display:\s*none[^>]*>.*?</\1>#s","",$page);`

The code it returning NULL with a PREG_BACKTRACK_LIMIT_ERROR.
I tried this instead:

$page = preg_replace("#<([a-z]+)[^>]*?style=\"[^\"]*?display:\s*none[^>]*>.*?</\1>#s","",$page);

But now it's just not replacing any tags.

Any help would be much appreciated. Thanks!

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • 3
    Just. Don't. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Matt Ball Dec 08 '10 at 22:47
  • possible duplicate of [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php) – PeeHaa Jan 16 '12 at 20:01

2 Answers2

2

You should never parse HTML with Regex. That makes your eyes bleed. HTML is not regular in any form. It should be parsed by using a DOM-parser.

Parse HTML to DOM with PHP

Robin Orheden
  • 2,714
  • 23
  • 24
2

Using DOMDocument, you can try something like this:

$doc = new DOMDocument;
$doc->loadHTMLFile("foo.html");
$nodeList = $doc->getElementsByTagName('*');
foreach($nodeList as $node) {
    if(strpos(strtolower($node->getAttribute('style')), 'display: none') !== false) {
        $doc->removeChild($node);
    }
}
$doc->saveHTMLFile("foo.html");
karim79
  • 339,989
  • 67
  • 413
  • 406
  • Thank you - for actually giving an answer :p – Niet the Dark Absol Dec 08 '10 at 22:58
  • @Kolink - I just edited, made it a bit more robust by incorporating `strpos` to make it work when there are additional style elements present, yet there are still plenty of potential improvements. For example, trimming the attribute using `trim` and also testing for 'display:none' (no space). – karim79 Dec 08 '10 at 23:08
  • I gave you an answer. But just not the whole solution. – Robin Orheden Dec 08 '10 at 23:08
  • 1
    @karim: I used preg_match instead, to handle `display: none` with a variable amount of spaces. I also needed to use `$node->parentNode->removeChild($node)` - it works now. Thanks :) – Niet the Dark Absol Dec 08 '10 at 23:20
  • @Kolink - Can you edit your fixes into my answer? I did not get a chance to test. – karim79 Dec 08 '10 at 23:22