2

I'm stripping out all style attributes from some html. I could use the regex

/style=("[^"]"|'[^']')/

But I wonder if this is inefficient (due to the negative matching). I also know it's vulnerable to style attributes (e.g. background-image) that can contain quotes.

Is there a regex I can use to match valid style strings or, like parsing html with regex, is this a task too difficult for a regex to perform in general?

*edit Here is (I think) the trickiest style string in the html I'm scraping

style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: 'Verdana','sans-serif'; mso-fareast-font-family: 'Times New Roman'"
wheresrhys
  • 22,558
  • 19
  • 94
  • 162

4 Answers4

2

I don't think, that negative matching is slow in every case. After all, when you provide the starting point with style= the following bytes are compared to the pattern anyway.

You must, however, cater for the case, where attributes are not enclosed in quotes.

/style=(".*?"|'.*?'|[^"'][^\s]*)/s

should match all productions of HTML attribute syntax. However, make sure, that the dot matches all characters including newlines (hence the /s) in your regex engine. I also used non-greedy quantifiers *?. These can possibly also be not implemented.

There is the special case of style= without any following value, that is not represented above to keep it simpler.

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
0

Try / style\=[\"\']?([a-zA-Z0-9 \:\-\#\(\)\.\_\/\;\'\,]+)\;?[\"\']? /ig

It supposed to find every style attribute I know.

http://jsfiddle.net/DULyx/3/ - check here

Michael Sazonov
  • 1,531
  • 11
  • 21
  • 2
    `url`s might be quoted though. – Christoph Apr 17 '12 at 10:51
  • Good effort, but it fails on `style='FONT-FAMILY: "Verdana"'`. In general I think a regex would have to be of the form `/("[allvalidchars and ']+"|('[allvalidchars and "]+')/` to avoid this pitfall, which is very irritating as it means either a) duplicating the character class or b) storing it as a string elsewher and having to escape things properly before concatenating and passing into `new RegExp()`. And even then it's vulnerable to e.g. `style='FONT-FAMILY: \'Verdana\''`. – wheresrhys Apr 17 '12 at 22:16
  • According to cases you suggest, there is no regexp to do that. Since you want to define a rule for searching - rules must be obayed by the css writer. Once the script doesn't follow a rule - how can you search through it? – Michael Sazonov Apr 17 '12 at 23:43
0

You shouldn't be processing HTML as a string. All you need in JS is elt.style='';. If you have the chance to run your stuff through XSLT it's a one-liner.

0
function trim (str) {
    return str.replace(/^\s\s*/, '').replace(/\s\s*$/, '');
}

function getStyle(element){
    return parseRules(element.getAttribute('style'))
}

function parseRules(rules){
  var parsed_rules= {}
      rules.split(';').map(function(rule){
          return rule.split(':').map(function(rule,index){
            // HERE YOU CAN TRY TO CLEAN THE RULES
            return trim( rule )
          })
      }).filter( function(rule){
            // HERE YOU CAN TEST THAT THE RULE IS VALID
          return rule.length == 2 && ( (rule[0]!="") || (rule[1]!="") )
      }).forEach(function(rule){
        parsed_rules[rule[0]] = rule[1]
      })


  return parsed_rules
}
Tegra Detra
  • 24,551
  • 17
  • 53
  • 78