Do not use regex to parse valid html. Use regex to parse an html document ONLY if all available DOM parsers are failing you. I super-love regex, but regex is "DOM-ignorant" and it will quietly fail and/or mutate your document.
I generally prefer a mix of DOMDocument and XPath to concisely, directly, and intuitively target document entities.
With only a couple of minor exceptions, the XPath expression closely resembles its logic in plain English.
//@*[not(name()="src")]
- at any level in the document (
//
)
- find any attribute (
@*
)
- satisfying these requirements (
[]
)
- that is not (
not()
)
- named "src" (
name()="src"
)
This is far more readable, attractive, ad maintainable.
Code: (Demo)
$html = <<<HTML
<p id="paragraph" class="green">
This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/>
</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//@*[not(name()="src")]') as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
echo $dom->saveHTML();
Output:
<p>
This is a paragraph with an image <img src="/path/to/image.jpg">
</p>
If you want to add another exempt attribute, you can use or
//@*[not(name()="src" or name()="href")]