I'm reading CSS files from disk as strings.
My goal is to extract HTML classes paired with a specific data attribute like this:
.foo[data-my-attr]
The data attribute is unique enough so that I don't have to bother about traversing the CSS AST. I can simply use a regex like this:
(\.\S+)+\[data-my-attr\]
This already works, but \S+
is obviously a bad way to match an HTML class in a selector. It will include various combinators, pseudoclasses, pseudoselectors, etc.
I tried building a whitelist version of the regex, e. g. (\w|-)+
, but the HTML5 spec for class names is very permissive. It's inevitable that either I miss certain characters or include incorrect characters.
What regex can be used to to extract HTML5 classes from a CSS selector string?
I'm using Node, i. e. the JavaScript flavor of regexes.
UPD1
Some examples:
.foo[data-my-attr]
-- should match.foo
.foo>span[data-my-attr]
-- should not match.I_f%⌘ing_♥_HTML5[data-my-attr]
-- should match.I_f%⌘ing_♥_HTML5
This question exists because I'm unable to think of every possible valid HTML5 class. I need a regex based on the surprisingly vague HTML5 class spec:
The attribute, if specified, must have a value that is a set of space-separated tokens representing the various classes that the element belongs to.
The classes that an HTML element has assigned to it consists of all the classes returned when the value of the class attribute is split on spaces. (Duplicates are ignored.)
There are no additional restrictions on the tokens authors can use in the class attribute, but authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.
Obviously, a class shouldn't contain spaces and characters like +>:()[]=~
because they are part of CSS selector syntax...