4

I am still not able to use regular expressions by heart, thus could not find a final solution to strip out all styles from <p style="">...</p> using RegEx with Javascript, but leave color and background-color if they exist.

What I found:

1. Remove complete style="..." element with RegEx:

htmlString = (htmlString).replace(/(<[^>]+) style=".*?"/i, '');


2. Remove certain styles with RegEx:

htmlString = (htmlString).replace(/font-family\:[^;]+;?|font-size\:[^;]+;?|line-height\:[^;]+;?/g, '');


Challenge: In case, we remove all styles assigned (no color exists), and style is empty (we have style="" or style=" "), the style attribute should be removed as well.

I guess we need two lines of code?

Any help appreciated!


Example 1 (whitelisted "color" survives):

<p style="font-family:Garamond;font-size:8px;line-height:14px;color:#FF0000;">example</p>

should become:

<p style="color:#FF0000;">example</p>


Example 2 (all styles die):

<p style="font-family:Garamond;font-size:8px;line-height:14px;">example</p>

should become:

<p>example</p>
Avatar
  • 14,622
  • 9
  • 119
  • 198
  • 1
    Don't parse or modify HTML with Regex. It's not going to end well. – g.d.d.c Sep 13 '12 at 18:20
  • I know about this discussion thank you :) For my case it is fine to use RegEx. – Avatar Sep 13 '12 at 18:23
  • It isn't, it's never _fine_, it is _doable_ at best. [unless you are a devil worshipping, virgin-blood drinking wonderpony](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), mend your ways... please – Elias Van Ootegem Sep 13 '12 at 18:59
  • "No matter how many times we say it, they won't stop coming every day..." +1 --- However, my example above can be used for other scenarios (XML), and must not be bound to HTML ;) – Avatar Sep 13 '12 at 19:06
  • I'm assuming that you don't know the order in which the style attributes will appear beforehand, right? Because really, in that case you won't get a satisfactory solution with regex. – Tim Pietzcker Sep 13 '12 at 19:49
  • @TimPietzcker - It's possible. I have a solution that I'll post in a bit. – Andrew Cheong Sep 13 '12 at 20:35
  • @acheong87: Can't wait to see it. (I do also think it's possible, but I fear it's not going to be pretty. And if your whitelist grows in size, your regex will probably have to grow exponentially). – Tim Pietzcker Sep 13 '12 at 21:23
  • @TimPietzcker - You are right about it being ugly! The good news is that it grows linearly ;) Solution has been posted. – Andrew Cheong Sep 13 '12 at 21:28

2 Answers2

3

First, the proof of concept. Check out the Rubular demo.

The regex goes like this:

/(<[^>]+\s+)(?:style\s*=\s*"(?!(?:|[^"]*[;\s])color\s*:[^";]*)(?!(?:|[^"]*[;\s])background-color\s*:[^";]*)[^"]*"|(style\s*=\s*")(?=(?:|[^"]*[;\s])(color\s*:[^";]*))?(?=(?:|[^"]*)(;))?(?=(?:|[^"]*[;\s])(background-color\s*:[^";]*))?[^"]*("))/i

Broken down, it means:

(<[^>]+\s+)                           Capture start tag to style attr ($1).

(?:                                   CASE 1:

    style\s*=\s*"                     Match style attribute.

    (?!                               Negative lookahead assertion, meaning:
        (?:|[^"]*[;\s])               If color found, go to CASE 2.
        color\s*:[^";]*
    )

    (?!
        (?:|[^"]*[;\s])               Negative lookahead assertion, meaning:
        background-color\s*:[^";]*    If background-color found, go to CASE 2.
    )

    [^"]*"                            Match the rest of the attribute.

|                                     CASE 2:

    (style\s*=\s*")                   Capture style attribute ($2).

    (?=                               Positive lookahead.
        (?:|[^"]*[;\s])
        (color\s*:[^";]*)             Capture color style ($3),
    )?                                if it exists.

    (?=                               Positive lookahead.
        (?:|[^"]*)                    
        (;)                           Capture semicolon ($4),
    )?                                if it exists.

    (?=                               Positive lookahead.
        (?:|[^"]*[;\s])
        (background-color\s*:[^";]*)  Capture background-color style ($5),
    )?                                if it exists.

    [^"]*(")                          Match the rest of the attribute,
                                      capturing the end-quote ($6).
)

Now, the replacement,

\1\2\3\4\5\6

should always construct what you expect to have left!

The trick here, in case it's not clear, is to put the "negative" case first, so that only if the negative case fails, the captures (such as the style attribute itself) would be populated, by, of course, the alternate case. Otherwise, the captures default to nothing, so not even the style attribute will show up.

To do this in JavaScript, do this:

htmlString = htmlString.replace(

    /(<[^>]+\s+)(?:style\s*=\s*"(?!(?:|[^"]*[;\s])color\s*:[^";]*)(?!(?:|[^"]*[;\s])background-color\s*:[^";]*)[^"]*"|(style\s*=\s*")(?=(?:|[^"]*[;\s])(color\s*:[^";]*))?(?=(?:|[^"]*)(;))?(?=(?:|[^"]*[;\s])(background-color\s*:[^";]*))?[^"]*("))/gi,

    function (match, $1, $2, $3, $4, $5, $6, offset, string) {
        return $1 + ($2 ? $2       : '') + ($3 ? $3 + ';' : '')
                  + ($5 ? $5 + ';' : '') + ($2 ? $6       : '');
    }

);

Note that I'm doing this for fun, not because this is how this problem should be solved. Also, I'm aware that the semicolon-capture is hacky, but it's one way of doing it. And one can infer how to extend the whitelist of styles, looking at the breakdown above.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • +1. Yikes. OK, so technically, it isn't a pure regex-replace because you need additional program logic to pick the replacements, but kudos for sheer regex bravery! – Tim Pietzcker Sep 13 '12 at 21:43
  • @TimPietzcker - Well, it's only JavaScript, that (1) to use back-references one needs to invoke a function, and (2) uncaptured groups don't default to empty strings, so you can't just concatenate them all. But yes, I can see how my claim may have misled! – Andrew Cheong Sep 13 '12 at 23:26
  • Very good answer, and great that you also explained the parts! I used your **javascript code** on both examples from the question, however, the first example becomes: _

    example

    _ instead of _

    example

    _ ?
    – Avatar Sep 14 '12 at 07:36
  • Are you sure? It works for me. Here's a [jsFiddle](http://jsfiddle.net/acheong87/cqAsB/). – Andrew Cheong Sep 14 '12 at 13:26
  • I just realized that it works at my work computer (running Internet Explorer 8) but not on other browsers! No wonder you were saying it wasn't working. I'll have to revisit this at home. – Andrew Cheong Sep 17 '12 at 15:08
1

You can accomplish this without using Regex by using this function

function filter_inline_style(text){
    var temp_el = document.createElement("DIV");
    temp_el.innerHTML = text;
    var el            = temp_el.firstChild;
    console.log("el", el);

    // Check if text contain html tags
    if(el.nodeType == 1){
        var background = el.style.backgroundColor;
        var color      = el.style.color;

        el.removeAttribute('style');
        el.style.backgroundColor = background;
        el.style.color           = color;
        return el.outerHTML
    }

    return temp_el.innerHTML;
}

To use it:

var text = '<p style="font-size:8px;line-height:14px;color:#FF0000;background-color: red">example</p>';
var clean_text = filter_inline_style(text);
console.log(clean_text); 
// output: <p style="background-color: red; color: rgb(255, 0, 0);">example</p>
Mohamad Hamouday
  • 2,070
  • 23
  • 20