2

I am want to create an output text filter to replaces all the <img> elements in the DOM with the following text "no images allowed".

I.e.: If the user creates this HTML markup:

<p><img src="/image.jpg" /></p>

the following HTML is rendered:

<p>no images allowed</p>

Please note that I cannot use preg_replace. The question is simplified and I need to parse the DOM to to find what images to disallow.

Thanks to this answer, I found that getElementsByTagName() returns "live" iterator, so you need two steps, so I have this:

foreach ($elements as $element) {
  $domArray[] = $element;
  $src= $element->getAttribute('src');
  $frag= $dom->createElement('p');
  $frag->nodeValue = 'no images allowed';
  $element->parentNode->appendChild($frag);
}
// loop through the array and delete each node
$nodes = iterator_to_array($dom->getElementsByTagName('img'));
foreach ($nodes as $node) {
  $node->parentNode->removeChild($node);
}
$newtext = $dom->saveHTML();

It almost do what I want, but I get this:

<p><p>no images allowed</p></p>
Free Radical
  • 2,052
  • 1
  • 21
  • 35
  • What happens if the user creates a `

    ` element with the `` AND some text or other elements inside?

    – guido Oct 14 '17 at 13:50
  • @GUIDO, creating a `

    `-element work as one should expect (see updated question). It is getting rid of the ``-element that is the problem.

    – Free Radical Oct 14 '17 at 14:08

2 Answers2

2

To remove HTML self-enclosed img tag you may use a simple regular expression:

<?php

function no_images_allowed($text) {
    return preg_replace('/<img[^>]*>/', 'no images allowed', $text);
}

print no_images_allowed('<p><img src="/image.jpg" /></p>');

It is simpler and should be much more efficient, you do not need to travers over every DOM element, just process plain text.

Regex in example above will only work for self-enclosed img tag:

<img src="..."/>
<img src="...">

Please note that it will not work for example with:

<img src="..."></img>
<IMG SRC="..."/>
<img src="...">invalid content</img>

If you want to include every possible case (even invalid ones) then proposed regex should be modified.

Paweł Tatarczuk
  • 636
  • 2
  • 6
  • 13
  • That's an approach that I would have suggested as well. – iquellis Oct 14 '17 at 14:17
  • Have you ever seen this: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ? What about `>`? – guido Oct 14 '17 at 14:29
  • I agree that regex queries cannot replace real DOM parsing, but I don't think it is necessary to parse whole DOM only to remove self-enclosing img tag. It all depends on what you need. I would not recommend regex for more complex DOM modifications. – Paweł Tatarczuk Oct 14 '17 at 14:42
  • To replace *every* image tag, preg_replace would work great. However, I really want to do a conditional replacement depending on the value of the attributes, so I need tp parse the DOM. – Free Radical Oct 14 '17 at 14:49
  • On the margin, I filtered this page source with preg_replace and DOMXPath solutions, the first took 0.00027ms, the latter 0.00261ms, preg_replace is ~10 times faster. Regex downside is that it won't work with invalid HTML or in more complex case. – Paweł Tatarczuk Oct 14 '17 at 15:01
2

I would fetch the elements with xpath, then replace with newly created text nodes.

$xp = new DOMXPath($dom);
$elements = $xp->query('//img');
foreach ($elements as $element) {
  $frag= $dom->createTextNode('no images allowed');
  $element->parentNode->insertBefore($frag, $element);
  $element->parentNode->removeChild($element);
}
echo $dom->saveHtml();

Demo here: http://codepad.org/w9uj0ez9

guido
  • 18,864
  • 6
  • 70
  • 95