Regex Remove Images with style tag from Html

Question

I am new to Regex, however I decided it was the easiest route to what I needed to do. Basically I have a string (in PHP) which contains a whole load of HTML code... I want to remove any tags which have style=display:none...

so for example

<img src="" style="display:none" />

<img src="" style="width:11px;display: none" >

etc...

So far my Regex is:

<img.*style=.*display.*:.*none;.* >

But that seems to leave bits of html behind and also take the next element away when used in php with preg_replace.

score 4 · Answer 1 · answered May 05 '10 at 11:47

4

$html = preg_replace("/<img[^>]+style[^>]+none[^>]+>/", '', $html);

answered May 05 '10 at 11:47

Anatoly Orlov

406
3
7

thanks works great... no idea how you came up with it but works! – Mark Milford May 05 '10 at 12:04
1

this will match any IMG elements with any css attribute in style containing the word "none", including `border-style:none;` – Gordon May 05 '10 at 13:03
Gordon: Yes, y're right. it's easy to modify: $html = preg_replace("/]+style[^>]display:\s*none[^>]+>/", '', $html); – Anatoly Orlov May 05 '10 at 13:10
1

`` – Amarghosh May 05 '10 at 14:02

score 4 · Accepted Answer · edited May 23 '17 at 12:26

Like Michael pointed out, you don't want to use Regex for this purpose. A Regex does not know what an element tag is. <foo> is as meaningful as >foo< unless you teach it the difference. Teaching the difference is incredibly tedious though.

DOM is so much more convenient:

$html = <<< HTML
<img src="" style="display:none" />
<IMG src="" style="width:11px;display: none" >
<img src="" style="width:11px" >
HTML;

The above is our (invalid) markup. We feed it to DOM like this:

$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->normalizeDocument();

Now we query the DOM for all "IMG" elements containing a "style" attribute that contains the text "display". We could query for "display: none" in the XPath, but our input markup has occurences with no space inbetween:

$xpath = new DOMXPath($dom);
foreach($xpath->query('//img[contains(@style, "display")]') as $node) {
    $style = str_replace(' ', '', $node->getAttribute('style'));
    if(strpos($style, 'display:none') !== FALSE) {
        $node->parentNode->removeChild($node);
    }
}

We iterate over the IMG nodes and remove all whitespace from their style attribute content. Then we check if it contains "display:none" and if so, remove the element from the DOM.

Now we only need to save our HTML:

echo $dom->saveHTML();

gives us:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><img src="" style="width:11px"></body></html>

Screw Regex!

Addendum: you might also be interested in Parsing XML documents with CSS selectors

thanks, didn't realize there was a dom parse built into php (although I should have guessed there is a function for everything else)... your suggestion has worked, even with unusual images... — Mark Milford, May 05 '10 at 15:41
Something to note with the above, after testing for some time it doesn't work if the 'display' is capital... use: [contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), "display")] for the xpath instead — Mark Milford, May 10 '10 at 17:51
@Mark you could also use http://de.php.net/manual/en/domxpath.registerphpfunctions.php and use `strotolower` or `stripos` — Gordon, May 10 '10 at 18:21

score 0 · Answer 3 · answered May 05 '10 at 11:46

Because <img> doesn't allow any other elements inside it, this is possible; but in general, regexp is a thoroughly bad tool for parsing a recursively defined language like HTML.

Anyway, the problem you're probably hitting is that the closing > is being matched by one of the .* expressions, and there happens to be a later > on the line to match your explicit > .

If you replace all your .* by [^>]* that will prevent that. (They probably don't all need to be replaced, but you might as well).

score 0 · Answer 4 · edited May 23 '17 at 10:27

Your regular expression is way too broad; .* means "match anything", so this would match:

<img src="foo.png" style="something">Some random displayed text : foo none; bar<br>

At the very least, you probably want to exclude closing brackets from your matches, so [^>]* instead of .*. You also might want to read this, though, and look into using something that actually understands HTML, like DOMDocument

score 0 · Answer 5 · answered Mar 28 '20 at 13:20

Here is another version which works with all tags including ones with spaces between the inline style display:none or display: none. Plus it deletes the content inside the tags.

$html = preg_replace('/<[^>]+style[^>]+display:\s*none[^>]+>.*?>/', '', $html);

So I have tested it with the following and it works fine.

Only show<div style='display:none'>Delete inside content as well</div> this text.

Only show<span style='display: none'>Delete inside content as well</span> this text.

Only show<div style="display: none">Delete inside content as well</div> this text.

Only show<span style="display:none;">Delete inside content as well</span> this text.

Should now only output.

Only show this text.

Regex Remove Images with style tag from Html

5 Answers5

Linked

Related