2

I'm reading CSS files from disk as strings.

My goal is to extract HTML classes paired with a specific data attribute like this:

.foo[data-my-attr] 

The data attribute is unique enough so that I don't have to bother about traversing the CSS AST. I can simply use a regex like this:

(\.\S+)+\[data-my-attr\]

This already works, but \S+ is obviously a bad way to match an HTML class in a selector. It will include various combinators, pseudoclasses, pseudoselectors, etc.

I tried building a whitelist version of the regex, e. g. (\w|-)+, but the HTML5 spec for class names is very permissive. It's inevitable that either I miss certain characters or include incorrect characters.

What regex can be used to to extract HTML5 classes from a CSS selector string?

I'm using Node, i. e. the JavaScript flavor of regexes.

UPD1

Some examples:

  • .foo[data-my-attr] -- should match .foo
  • .foo>span[data-my-attr] -- should not match
  • .I_f%⌘ing_♥_HTML5[data-my-attr] -- should match .I_f%⌘ing_♥_HTML5

This question exists because I'm unable to think of every possible valid HTML5 class. I need a regex based on the surprisingly vague HTML5 class spec:

3.2.5.7 The class attribute

The attribute, if specified, must have a value that is a set of space-separated tokens representing the various classes that the element belongs to.

The classes that an HTML element has assigned to it consists of all the classes returned when the value of the class attribute is split on spaces. (Duplicates are ignored.)

There are no additional restrictions on the tokens authors can use in the class attribute, but authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.

Obviously, a class shouldn't contain spaces and characters like +>:()[]=~ because they are part of CSS selector syntax...

Andrey Mikhaylov - lolmaus
  • 23,107
  • 6
  • 84
  • 133

2 Answers2

2

You shouldn't use a regular expression.

A much more solid alternative is PostCSS (and its parser). With it, you will get a full AST (abstract syntax tree) of the whole stylesheet, with it you'll be able to easily extract the part you are looking for.

const postcss = require('postcss');
const Tokenizer = require('css-selector-tokenizer');

let output = [];

const postcssAttributes = postcss.plugin('postcss-attributes', function() {
  return function(css) {
    css.walkRules(function(rule) {
      rule.selectors.map(selector => {
        const tokenized = Tokenizer.parse(selector);
        if (
          tokenized.nodes.some(({ nodes }) =>
            nodes.some(
              node =>
                node.type === 'attribute' && node.content === 'data-my-attr'
            )
          )
        ) {
          output.push(selector);
        }
      });
    });
  };
});

const css = `
    .foo[data-my-attr] {
        color: red;
    }
    .foo[something] {
        color: red;
    }
`;

postcss([postcssAttributes])
  .process(css)
  .then(result => console.log(output));

// logs: [ '.foo[data-my-attr]' ]

This will log all the matching selectors.

Fez Vrasta
  • 14,110
  • 21
  • 98
  • 160
  • Thank you for your example. I've been considering using a CSS AST and decided against it for two reasons: 1. It will make my build times longer. 2. It does not solve the problem of extracting HTML classes from compound selectors, which will still require regexes. – Andrey Mikhaylov - lolmaus Nov 25 '17 at 20:11
  • My example does support compound selectors – Fez Vrasta Nov 26 '17 at 10:05
0

The regex to match an HTML5 class in a selector string is:

/\.-?(?:[_a-z]|[\240-\377]|(?:(:?\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?)|\\[^\r\n\f0-9a-f]))(?:[_a-z0-9-]|[\240-\377]|(?:(:?\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?)|\\[^\r\n\f0-9a-f]))*/

Credit: @KOBA789

Thx to Alohci for pointing in the right direction.

Andrey Mikhaylov - lolmaus
  • 23,107
  • 6
  • 84
  • 133