3

I'm newbie to regular expressions, trying to filter the HTML tags keeping only required (src / href / style) attribute with their values and remove unnecessary attributes. While googling I found a regular expression to keep only "src" attribute, hence my modified expression is as follows:

<([a-z][a-z0-9]*)(?:[^>]*(\s(src|href|style)=['\"][^'\"]*['\"]))?[^>]*?(\/?)>

Its working fine but the only problem is, if one tag contains more than one required attribute then it keeps only the last matched single attribute and discards the rest.

I'm trying to clean following text

<title>Hello World</title>
<div fadeout"="" style="margin:0px;" class="xyz">
    <img src="abc.jpg" alt="" />
    <p style="margin-bottom:10px;">
        The event is celebrating its 50th anniversary K&ouml;&nbsp;
        <a style="margin:0px;" href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
    </p>
    <p style="padding:0px;"></p>
    <p style="color:black;">
        <strong>A festival for art lovers</strong>
    </p>
</div>

at https://regex101.com/#javascript using aforementioned expression with <$1$2$4> as substitution string and getting following output:

<title>Hello World</title>
<div style="margin:0px;">
    <img src="abc.jpg"/>
    <p style="margin-bottom:10px;">
        The event is celebrating its 50th anniversary K&ouml;&nbsp;
        <a href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
    </p>
    <p style="padding:0px;"></p>
    <p style="color:black;">
        <strong>A festival for art lovers</strong>
    </p>
</div>

Problem is "style" attribute is discarded from anchor tag. I have tried to replicate the (\s(src|href|style)=['\"][^'\"]*['\"]) block using * operator, {3} selector and much more but in vain. Any suggestions???

Ketan
  • 242
  • 1
  • 11
Ahmad Ahsan
  • 187
  • 3
  • 18
  • I can suggest using RegexBuddy for testing expressions. It saved me a lot of time in the past. https://www.regexbuddy.com/ – Bozidar Sikanjic Apr 08 '16 at 08:24
  • For reference, OP's code can be found at https://regex101.com/r/mP0pX6/1 – Adrian Wragg Apr 08 '16 at 08:25
  • 1
    Why don't you use DOM manipulation instead of RegEX? – Salman A Apr 08 '16 at 09:40
  • 1
    @SalmanA I'm trying to do the same using DOM manipulation but jquery 1.9.1 is failing. jQuery 2.0.0 fixes the issue but my application other libraries are not compatible. Any suggestion? Here is my fiddler test link : https://jsfiddle.net/vytu9duc/5/ Facing following error in console: Uncaught InvalidCharacterError: Failed to execute 'setAttribute' on 'Element': 'fadeout"' is not a valid attribute name. Any suggestion? – Ahmad Ahsan Apr 11 '16 at 14:40
  • Related: [Regex to remove HTML attribute from any HTML tag?](https://stackoverflow.com/q/7529068/104380) – vsync Nov 10 '20 at 20:48

2 Answers2

5

@AhmadAhsan here is demo to fix your issue using DOM manipulation: https://jsfiddle.net/pu1hsdgn/

   <script src="https://code.jquery.com/jquery-1.9.1.js"></script>
    <script>
        var whitelist = ["src", "href", "style"];
        $( document ).ready(function() {
            function foo(contents) {
            var temp = document.createElement('div');
            var html = $.parseHTML(contents);
            temp = $(temp).html(contents);

            $(temp).find('*').each(function (j) {
                var attributes = this.attributes;
                var i = attributes.length;
                while( i-- ) {
                    var attr = attributes[i];
                    if( $.inArray(attr.name,whitelist) == -1 )
                        this.removeAttributeNode(attr);
                }
            });
            return $(temp).html();
        }
        var raw = '<title>Hello World</title><div style="margin:0px;" fadeout"="" class="xyz"><img src="abc.jpg" alt="" /><p style="margin-bottom:10px;">The event is celebrating its 50th anniversary K&ouml;&nbsp;<a href="http://www.germany.travel/" style="margin:0px;">exhibition grounds in Cologne</a>.</p><p style="padding:0px;"></p><p style="color:black;"><strong>A festival for art lovers</strong></p></div>'
        alert(foo(raw));
    });
    </script>
1

Here you go, based on your original regex:

<([a-z][a-z0-9]*?)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]*?(\/?)>

Group 1 is the tag name, group 2 are the attributes, and group 3 is the / if there is one. I couldn't get it to work with non-allowed attributes interleaved with allowed attributes e.g. <a href="foo" class="bar" src="baz" />. I don't think it can be done.

Edit: Per @AhmadAhsan's corrections below the regex should be:

var html = `<div fadeout"="" style="margin:0px;" class="xyz">
                <img src="abc.jpg" alt="" />
                <p style="margin-bottom:10px;">
                    The event is celebrating its 50th anniversary K&ouml;&nbsp;
                    <a style="margin:0px;" href="http://www.germany.travel/">exhibition grounds in Cologne</a>.
                </p>
                <p style="padding:0px;"></p>
                <p style="color:black;">
                    <strong>A festival for art lovers</strong>
                </p>
            </div>`


console.log( 
  html.replace(/<([a-z][a-z0-9]*)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]‌​*?(\/?)>/, '')
)
    
vsync
  • 118,978
  • 58
  • 307
  • 400
Joels Elf
  • 714
  • 6
  • 10
  • 1
    Instead of lazy search '*?', for tag name it should be hungry '*' otherwise it is returning with 't' only instead of 'title'. Used following with substring <$1$2$3>: <([a-z][a-z0-9]*)(?:[^>]*?((?:\s(?:src|href|style)=['\"][^'\"]*['\"]){0,3}))[^>]*?(\/?)> Although it's not fulfilling my requirement but might be helpful for anyone else. – Ahmad Ahsan Apr 14 '16 at 06:26
  • @AhmadAhsan You're right. I only tested it on an `a` tag. – Joels Elf Apr 15 '16 at 19:46