Regex to extract HTML5 classes from a CSS selector string

Question

I'm reading CSS files from disk as strings.

My goal is to extract HTML classes paired with a specific data attribute like this:

.foo[data-my-attr]

The data attribute is unique enough so that I don't have to bother about traversing the CSS AST. I can simply use a regex like this:

(\.\S+)+\[data-my-attr\]

This already works, but \S+ is obviously a bad way to match an HTML class in a selector. It will include various combinators, pseudoclasses, pseudoselectors, etc.

I tried building a whitelist version of the regex, e. g. (\w|-)+, but the HTML5 spec for class names is very permissive. It's inevitable that either I miss certain characters or include incorrect characters.

What regex can be used to to extract HTML5 classes from a CSS selector string?

I'm using Node, i. e. the JavaScript flavor of regexes.

UPD1

Some examples:

.foo[data-my-attr] -- should match .foo
.foo>span[data-my-attr] -- should not match
.I_f%⌘ing_♥_HTML5[data-my-attr] -- should match .I_f%⌘ing_♥_HTML5

This question exists because I'm unable to think of every possible valid HTML5 class. I need a regex based on the surprisingly vague HTML5 class spec:

3.2.5.7 The class attribute

The attribute, if specified, must have a value that is a set of space-separated tokens representing the various classes that the element belongs to.

The classes that an HTML element has assigned to it consists of all the classes returned when the value of the class attribute is split on spaces. (Duplicates are ignored.)

There are no additional restrictions on the tokens authors can use in the class attribute, but authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.

Obviously, a class shouldn't contain spaces and characters like +>:()[]=~ because they are part of CSS selector syntax...

Whoever is voting to close the question, please explain in the comments what can be fixed to make this question valid. — Andrey Mikhaylov - lolmaus, Nov 25 '17 at 11:45
Will this https://stackoverflow.com/a/6329126/1156518 regex extended with your specific attribute work for you? — Dmitry Druganov, Nov 25 '17 at 11:55
@DmitryDruganov No, it's valid for HTML4, but will omit many HTML5-valid classes, such as `#%LV-||_⌘⌥♥{©♤₩¤☆€~¥}`. — Andrey Mikhaylov - lolmaus, Nov 25 '17 at 12:11
What is the problem? Choose a character class that excludes characters you don't want. From your description: `[^#+>:()\[\]=~\s.]` — Casimir et Hippolyte, Nov 25 '17 at 13:14
Note that `#` can't be in a class name, since it's a selector for ids. Same thing about curly brackets. — Casimir et Hippolyte, Nov 25 '17 at 13:22
You're working from the wrong spec. The relevant spec is not the HTML5 spec, but the Selectors spec, and in particular the [selectors_group](https://www.w3.org/TR/css3-selectors/#grammar) production. — Alohci, Nov 25 '17 at 14:23
why should `.I_f#%⌘ing_♥_HTML5` match? It contains a `#` which is the start of an `id` selector for the element with the id `%⌘ing_♥_HTML5`. — Patrick J. S., Nov 25 '17 at 15:24
@CasimiretHippolyte The problem is that I don't have an explicit list of exclusions. — Andrey Mikhaylov - lolmaus, Nov 25 '17 at 16:05
@PatrickJ.S. Good catch. But even though it won't match in CSS, the `I_f#%⌘ing_♥_HTML5` is still a valid HTML5 class and can be targeted with, for example, `document.getElementsByClassName("I_f#%⌘ing_♥_HTML5")`. — Andrey Mikhaylov - lolmaus, Nov 25 '17 at 16:09

Fez Vrasta · Answer 1 · 2017-11-25T16:39:53.467

You shouldn't use a regular expression.

A much more solid alternative is PostCSS (and its parser). With it, you will get a full AST (abstract syntax tree) of the whole stylesheet, with it you'll be able to easily extract the part you are looking for.

const postcss = require('postcss');
const Tokenizer = require('css-selector-tokenizer');

let output = [];

const postcssAttributes = postcss.plugin('postcss-attributes', function() {
  return function(css) {
    css.walkRules(function(rule) {
      rule.selectors.map(selector => {
        const tokenized = Tokenizer.parse(selector);
        if (
          tokenized.nodes.some(({ nodes }) =>
            nodes.some(
              node =>
                node.type === 'attribute' && node.content === 'data-my-attr'
            )
          )
        ) {
          output.push(selector);
        }
      });
    });
  };
});

const css = `
    .foo[data-my-attr] {
        color: red;
    }
    .foo[something] {
        color: red;
    }
`;

postcss([postcssAttributes])
  .process(css)
  .then(result => console.log(output));

// logs: [ '.foo[data-my-attr]' ]

This will log all the matching selectors.

Thank you for your example. I've been considering using a CSS AST and decided against it for two reasons: 1. It will make my build times longer. 2. It does not solve the problem of extracting HTML classes from compound selectors, which will still require regexes. — Andrey Mikhaylov - lolmaus, Nov 25 '17 at 20:11

score 0 · Answer 2 · answered Nov 25 '17 at 16:05

0

The regex to match an HTML5 class in a selector string is:

/\.-?(?:[_a-z]|[\240-\377]|(?:(:?\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?)|\\[^\r\n\f0-9a-f]))(?:[_a-z0-9-]|[\240-\377]|(?:(:?\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?)|\\[^\r\n\f0-9a-f]))*/

Credit: @KOBA789

Thx to Alohci for pointing in the right direction.

answered Nov 25 '17 at 16:05

Andrey Mikhaylov - lolmaus

23,107
6
84
133

Really? What about `#notaclass:after { content:".notaclasstoo { whatever you want"; }` – Casimir et Hippolyte Nov 25 '17 at 16:12
@CasimiretHippolyte Your example is not a valid selector. – Andrey Mikhaylov - lolmaus Nov 25 '17 at 16:23
What is invalid? – Casimir et Hippolyte Nov 25 '17 at 16:44
Your code sample is a CSS rule, the question is about a CSS selector. – Andrey Mikhaylov - lolmaus Nov 25 '17 at 20:08
Yes, it's a CSS rule, but how can you be sure to extract a CSS selector, even with a pattern that describes all possible selectors or the one you want, from a string that contains quoted parts? Inside quoted parts, you can also have something that matches your pattern and that isn't a selector. – Casimir et Hippolyte Nov 25 '17 at 20:58
That's a valid concern. They don't even have to be quoted, e. g.: `.foo:not(.bar)`. Luckily, my use case doesn't suffer from non-existing classes being harvested. The important part is to not to miss any existing ones. – Andrey Mikhaylov - lolmaus Nov 26 '17 at 07:44

Regex to extract HTML5 classes from a CSS selector string

UPD1

2 Answers2