regex to parse CSS from HTML string fails when child combinator is used

Question

I'm using the following regex in my JavaScript code to parse CSS (style tags and their contents) from a larger string of HTML code

const regex = /<style[^>]*>([^>]*?)<\/style>/g

This works fine, unless the CSS code contained within the style tags includes a CSS child combinator selector (a CSS selector like div > a for example). I imagine this has something to do with the fact that this particular selector uses > which is also syntax used to create the actual <style> tags in HTML, but I don't understand regex well enough to know if there's a way around this?

const str1 = '<style> div { color: red; } a { color: green; } </style>hello<div></div><div><a>hello</a></div>'
const str2 = '<style> div { color: red; } div > a { color: green; } </style>hello<div></div><div><a>hello</a></div>'

const regex = /<style[^>]*>([^>]*?)<\/style>/g

const matches1 = str1.match(regex) // returns a match
const matches2 = str2.match(regex) // does NOT return a match

here's a fiddle

is there a way to modify the regex so that it also works when the CSS code contains a >?

UPDATE

A clarification based on discussion in the comments

In my particular case, I'm approaching this challenge via regex (rather than say parsing a DOM via document.querySelectorAll('style')) because the code needs to be able to run in different contexts (the JS runtime is found in various places these days, from browsers, to node to the Adobe suite) and so I was looking for a context agnostic solution

At the moment it seems @Maxt8r solution of changing the content expression to [\S\s]*? seems to have worked

You should change the content expression to `[\S\s]*?` since the closing style tag will be the very next html tag. Invisible content like that requires a closing tag. — , Oct 16 '20 at 21:29
Why not grab the element using `document.querySelector("style")`? The browser has already doing the parsing work for you using much more sophisticated techniques than regex, so it's a good idea to capitalize on all their hard work by using the provided API instead of attempting to [parse HTML with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) in an ad-hoc manner. — ggorlen, Oct 16 '20 at 21:30
@ggorlen in my particular use case i'm receiving a string of HTML, i do not have access to a DOM — Nick Briz, Oct 16 '20 at 21:31
You can parse it by creating an element and using `el.innerHTML = yourHTMLString`, then query away. This eliminates many edge cases in the long run. — ggorlen, Oct 16 '20 at 21:32
@ggorlen right, but creating an element assumes i have access to a DOM (ie. document.createElement()) ...i had considered using some libraries that would make that possible outside the browser... but, figured if i could get this regex to work then no other dependencies would be necessary — Nick Briz, Oct 16 '20 at 21:36

score 2 · Accepted Answer · 2020-10-16T21:56:21.983

The safer regex is this

/(?:<(style)(?:\s+(?=((?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+))\2)?\s*>)([\S\s]*?)<\/\1\s*>/

https://regex101.com/r/sx2YPf/1

and I recommend using this. The content is in group 3.

For reading

 (?:
    <
    ( style )           # (1), Invisible content; end tag req'd
    (?:
       \s+ 
       (?=
          (                   # (2 start)
             (?:
                " [\S\s]*? "
              | ' [\S\s]*? '
              | (?:
                   (?! /> )
                   [^>] 
                )?
             )+
          )                   # (2 end)
       )
       \2 
    )?
    \s* >
 )
 ( [\S\s]*? )        # (3)
 </ \1 \s* >

If anybody is curious, the lookahead assertion matching the rest of the
style tag inner attr/vals specifically not only does that validation,
but also insures the style tag is not self contained (if even a typo).
The contents of the assertion is passive and is immune to backtracking,
and is captured and inserted just past the assertion where backtracking
environment is but now the backreference is just a literal.
In the non JS environment like php, this is accomplished by substituting
an atomic group (>..) instead of the assertion.

Regular expressions remain a mysterious dark art...but up-voted regardless (since it seems to work) — David Thomas, Oct 16 '20 at 23:50

score 0 · Answer 2 · answered Oct 16 '20 at 23:46

This part ([^>]*?) of the regex expression tells it that anything in between

<style ...> and </style> is not allowed to contain the > character.

[^>] means NOT > so ANY > messes up your selection and you get the following results running in Node v14.13.1:

[Running] /usr/bin/env node "...scratch/regex_test.js"

const regex = /<style[^>]*>([^>]*?)<\/style>/g

console.log(str1.match(regex))
console.log(str2.match(regex))

// output

[ '<style> div { color: red; } a { color: green; } </style>' ]
null // <-- NOT MATCHING!!!!!

The end of your regex string already has a defined endpoint, the <\/style>/g part. One solution is to make your end identifier more than just '>' as you have done, but also remove the inner restrictions ...

So try this replacing the [^>] with a simple . for anything and you get this output:

[Running] /usr/bin/env node "...scratch/regex_test.js"

const regex = /<style[^>]*>(.*?)<\/style>/g // notice the (.*) part

console.log(str1.match(regex))
console.log(str2.match(regex))

// output

[ '<style> div { color: red; } a { color: green; } </style>' ]
[ '<style> div { color: red; } div > a { color: green; } </style>' ]

I'm not sure why you didn't use the standard 'greedy' quantifier * instead of the 'lazy' quantifier *?. If you don't need it then don't use it. Here is a cheat sheet about quantifiers.

A copy of the code is in this gist:

https://gist.github.com/skeptycal/32e7228e3b4cad8829a8958e871512e8

regex to parse CSS from HTML string fails when child combinator is used

2 Answers2