Strip all HTML attributes except for src

Question

I'm trying to remove all tag attributes except for the src attribute. For example:

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

Would be returned as:

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

I have a regular expression to strip all attributes, but I'm trying to tweak it to leave in src. Here's what I have so far:

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

You can parse HTML using regular expressions. Not all HTML. But if you know exactly what you're receiving you can use regular expressions. This is a religious war started by people who assume that infinite stacks and memory are available in all situations. — PP., Jun 08 '10 at 08:32

gnarf · Answer 1 · 2010-06-08T22:14:52.133

This might work for your needs:

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

Add some quoting, and use the replacement text <$1$2$3> it should strip any non src= properties from well-formed HTML tags.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp people are so cleverly noting below. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a full proof tags/attributes filter in PHP

Unless `>` appears in an attribute value. Parsing evil HTML is _hard_. Plus, you forgot to escape `\ `. — SLaks, Jun 08 '10 at 22:09
@gnarf can you please explain this to me, If I need/keep more than 1 attributes (eg `src` and `height`) then how should I modify your regular expressions. My scenario is exactly like this [Issue](http://stackoverflow.com/questions/36494743/remove-unnecessary-attributes-from-html-tag-using-javascript-regex?noredirect=1#comment60600396_36494743) — Qazi, Apr 13 '16 at 12:10
@qazi - Use a html parser or manipulator.... Regexp is not suited for the task, as src and height can appear in whatever order, and many other reasons you shouldn't use regular expressions to parse html — gnarf, Apr 18 '16 at 14:45
@gnarf I want to ignore `href` + `scr` Can you please guide me for this? — Muhammad Hassaan, Jan 23 '19 at 04:22
I would definitely not enjoy maintaining this regex (and I love regex). — mickmackusa, Jan 15 '21 at 22:55

score 8 · Answer 2 · edited May 23 '17 at 12:02

8

You usually should not parse HTML using regular expressions.

Instead, you should call DOMDocument::loadHTML.
You can then recurse through the elements in the document and call removeAttribute.

edited May 23 '17 at 12:02

Community

1
1

answered Jun 08 '10 at 02:34

SLaks

868,454
176
1,908
1,964

5

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. – fmark Jun 08 '10 at 04:25
2

You can parse HTML using regular expressions. Not all HTML. But if you know exactly what you're receiving you can use regular expressions. This is a religious war started by people who assume that infinite stacks and memory are available in all situations. – PP. Jun 08 '10 at 08:32
5

Some people have a terrible habit of not answering the question and instead obsessing about mantras. This should have been downvoted, not upvoted by the religious right. – PP. Jun 08 '10 at 08:33
3

Some people, when confronted with a problem, think "I know, I'll quote Jamie Zawinski." Now they have two problems. This really is the kind of problem that is best handled by a dedicated markup parser/processor, that's quite true. But regular expressions are a damn fine tool for many jobs, including some markup processing tasks, and it's foolish to outright dismiss them. – Weston C Jun 08 '10 at 21:51
1

I'm gonna have to agree with PP. Downvoted because of the dogmatic answer given. It IS possible to parse HTML with regular expressions, especially if you know exactly what you're going for. DOMDocument is great is some cases, but not all. – Ian McIntyre Silber Jun 08 '10 at 22:30
@SLaks While I agree with the sentiment, this answer isn't very generous. Perhaps you'd like to improve your answer by adding a conditional expression in the nested loop of https://stackoverflow.com/a/65741427/2943403. Or maybe you could allow my answer to carry the torch for you and give it a shorter path to the top of the page. I make this appeal to you because clearly rep is of no value to you any longer. – mickmackusa Jan 15 '21 at 22:54

Ian McIntyre Silber · Accepted Answer · 2010-06-11T17:51:38.777

1

Alright, here's what I used that seems to be working well:

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

Feel free to poke any holes in it.

edited Jun 11 '10 at 17:51

answered Jun 08 '10 at 21:32

Ian McIntyre Silber

5,553
13
53
76

I don't want to waste my time poking holes, though regex is notoriously easy to poke holes into when parsing an html document. I hope that you will consider accepting my clean, professional, and robust answer. ...not because I care about fake points but because I want researchers to find the best solution on this old page. – mickmackusa Jan 15 '21 at 22:32

score 0 · Answer 4 · answered Jun 08 '10 at 08:40

Unfortunately I'm not sure how to answer this question for PHP. If I were using Perl I would do the following:

use strict;
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;

$data =~ s{
    <([^/> ]+)([^>]+)> # split into tagtype, attribs
}{
    my $attribs = $2;
    my @parts = split( /\s+/, $attribs ); # separate by whitespace
    @parts = grep { m/^src=/i } @parts;   # retain just src tags
    if ( @parts ) {
        "<" . join( " ", $1, @parts ) . ">";
    } else {
        "<" . $1 . ">";
    }
}xseg;

print( $data );

which returns

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>

mickmackusa · Answer 5 · 2021-01-15T22:41:29.363

Do not use regex to parse valid html. Use regex to parse an html document ONLY if all available DOM parsers are failing you. I super-love regex, but regex is "DOM-ignorant" and it will quietly fail and/or mutate your document.

I generally prefer a mix of DOMDocument and XPath to concisely, directly, and intuitively target document entities.

With only a couple of minor exceptions, the XPath expression closely resembles its logic in plain English.

//@*[not(name()="src")]

at any level in the document (//)
find any attribute (@*)
satisfying these requirements ([])
that is not (not())
named "src" (name()="src")

This is far more readable, attractive, ad maintainable.

Code: (Demo)

$html = <<<HTML
<p id="paragraph" class="green">
    This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/>
</p>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//@*[not(name()="src")]') as $attr) {
    $attr->parentNode->removeAttribute($attr->nodeName);
}
echo $dom->saveHTML();

Output:

<p>
    This is a paragraph with an image <img src="/path/to/image.jpg">
</p>

If you want to add another exempt attribute, you can use or

//@*[not(name()="src" or name()="href")]

@Hassaan See the bottom of my answer for the expression to retain `src` and `html` attributes. — mickmackusa, Jan 15 '21 at 22:57

score -1 · Answer 6 · answered Jun 08 '10 at 22:28

As above introduced you shouldn use regex to parse html, or xml.

I would do your example with str_replace(); if its all time the same.

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

$str = str_replace('id="paragraph" class="green"', "", $str);

$str = str_replace('width="50" height="75"',"",$str);

Strip all HTML attributes except for src

6 Answers6

Linked