Regex to match valid values for html style attribute

Question

I'm stripping out all style attributes from some html. I could use the regex

/style=("[^"]"|'[^']')/

But I wonder if this is inefficient (due to the negative matching). I also know it's vulnerable to style attributes (e.g. background-image) that can contain quotes.

Is there a regex I can use to match valid style strings or, like parsing html with regex, is this a task too difficult for a regex to perform in general?

*edit Here is (I think) the trickiest style string in the html I'm scraping

style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: 'Verdana','sans-serif'; mso-fareast-font-family: 'Times New Roman'"

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Sibster, Apr 17 '12 at 10:39
@Sibster I'm aware of that question & answer, but my question is a lot narrower than that — wheresrhys, Apr 17 '12 at 10:44
@wheresrhys You can also have attributes w/o quotes: `style=font-weight:bold` is valid. — Boldewyn, Apr 17 '12 at 11:13
@Boldewyn If it were up to me there wouldn't be any style attributes at all... unfortunately though, I'm having to scrape the html from a third party so have no control over whether or not the quotes are there — wheresrhys, Apr 17 '12 at 22:07

score 2 · Accepted Answer · answered Apr 17 '12 at 11:19

I don't think, that negative matching is slow in every case. After all, when you provide the starting point with style= the following bytes are compared to the pattern anyway.

You must, however, cater for the case, where attributes are not enclosed in quotes.

/style=(".*?"|'.*?'|[^"'][^\s]*)/s

should match all productions of HTML attribute syntax. However, make sure, that the dot matches all characters including newlines (hence the /s) in your regex engine. I also used non-greedy quantifiers *?. These can possibly also be not implemented.

There is the special case of style= without any following value, that is not represented above to keep it simpler.

Michael Sazonov · Answer 2 · 2012-04-17T11:17:10.323

0

Try / style\=[\"\']?([a-zA-Z0-9 \:\-\#\(\)\.\_\/\;\'\,]+)\;?[\"\']? /ig

It supposed to find every style attribute I know.

http://jsfiddle.net/DULyx/3/ - check here

edited Apr 17 '12 at 11:17

answered Apr 17 '12 at 10:39

Michael Sazonov

1,531
11
21

2

`url`s might be quoted though. – Christoph Apr 17 '12 at 10:51
Good effort, but it fails on `style='FONT-FAMILY: "Verdana"'`. In general I think a regex would have to be of the form `/("[allvalidchars and ']+"|('[allvalidchars and "]+')/` to avoid this pitfall, which is very irritating as it means either a) duplicating the character class or b) storing it as a string elsewher and having to escape things properly before concatenating and passing into `new RegExp()`. And even then it's vulnerable to e.g. `style='FONT-FAMILY: \'Verdana\''`. – wheresrhys Apr 17 '12 at 22:16
According to cases you suggest, there is no regexp to do that. Since you want to define a rule for searching - rules must be obayed by the css writer. Once the script doesn't follow a rule - how can you search through it? – Michael Sazonov Apr 17 '12 at 23:43

score 0 · Answer 3 · answered Dec 01 '12 at 02:53

0

You shouldn't be processing HTML as a string. All you need in JS is elt.style='';. If you have the chance to run your stuff through XSLT it's a one-liner.

answered Dec 01 '12 at 02:53

Tegra Detra · Answer 4 · 2014-03-18T15:47:50.033

function trim (str) {
    return str.replace(/^\s\s*/, '').replace(/\s\s*$/, '');
}

function getStyle(element){
    return parseRules(element.getAttribute('style'))
}

function parseRules(rules){
  var parsed_rules= {}
      rules.split(';').map(function(rule){
          return rule.split(':').map(function(rule,index){
            // HERE YOU CAN TRY TO CLEAN THE RULES
            return trim( rule )
          })
      }).filter( function(rule){
            // HERE YOU CAN TEST THAT THE RULE IS VALID
          return rule.length == 2 && ( (rule[0]!="") || (rule[1]!="") )
      }).forEach(function(rule){
        parsed_rules[rule[0]] = rule[1]
      })


  return parsed_rules
}

Regex to match valid values for html style attribute

4 Answers4