Javascript efficient parsing of css selector

Question

What would be the most efficient way of parsing a css selector input string, that features any combination of:

[key=value] : attributes, 0 to * instances
#id : ids, 0 to 1 instances
.class : classes, 0 to * instances
tagName : tag names, 0 to 1 instances (found at start of string only)

(note: '*', or other applicable combinator could be used in lieu of tag?)

Such as:

div.someClass#id[key=value][key2=value2].anotherClass

Into the following output:

['div','.someClass','#id','[key=value]','[key2=value2]','.anotherClass']

Or for bonus points, into this form efficiently (read: a way not just based on using str[0] === '#' for example):

{
 tags : ['div'],
 classes : ['someClass','anotherClass'],
 ids : ['id'],
 attrs : 
   {
     key : value,
     key2 : value2
   }
}

(note removal of # . [ = ])

I imagine some combination of regex and .match(..) is the way to go, but my regex knowledge is nowhere near advanced enough for this situation.

Many thanks for your help.

regex is rarely the right solution for complex languages parsing. You should have a look at the many libraries doing this (like sizzle) — Denys Séguret, Jul 26 '13 at 18:04
I know sizzle does it, but I'm looking to implement my own simple solution. The domain is not as complex as a language, there is no whitespace etc, and a limited format for delimiters (as listed in the question) — Darius, Jul 26 '13 at 18:05
I was suggering to look at the source, not using it. If you want to parse css selectors, you should take whitespaces into account. — Denys Séguret, Jul 26 '13 at 18:06
OK I will consult the source, but I'm talking about tokens already split by whitespace. This question is about the next step after splitting the tokens delimited by whitespace — Darius, Jul 26 '13 at 18:07
@dystroy I think this is about parsing the selector "sub-syntax" for a single element match; I'm not sure what that's called. Also SCRIPTONITE note that it's not just splitting on whitespace - whitespace is an **operator** in the CSS selector syntax, comparable to the `+` and `~` connectors. — Pointy, Jul 26 '13 at 18:07
Also, to further clarify, this might be implemented server side, which means it can't be based on Sizzle's use of browser native methods — Darius, Jul 26 '13 at 18:08
@Pointy I take your point. But I think focusing on the subselectors first, before attempting more complex connections is a good approach — Darius, Jul 26 '13 at 18:16
How do you know the order of the selectors in your second example? — Gumbo, Jul 26 '13 at 18:19
Is it a requirement to preserve the order? The node has to match all the conditions regardless? Correct me if I'm wrong though! — Darius, Jul 26 '13 at 18:23
You **really** need to write your requirements. If you want to implement any CSS selector (including things like `:not`), then it's not a light project that will be answered here. — Denys Séguret, Jul 26 '13 at 18:23
@dystroy I think pseudo selectors are a future iteration. At this stage the requirements are the subset of qualifiers listed. Your solution below really helps — Darius, Jul 26 '13 at 18:28
@Pointy: The "sub-syntax" for a single element match is known as a compound selector; its components are called simple selectors. `+`, `~` and whitespace (where it matters!) are known as combinators. A series of compound selectors and combinators is called a complex selector. The terminology is taken from [Selectors 4](http://www.w3.org/TR/selectors4); the Selectors 3 recommendation has different names but those are very confusing. I wrote a full answer here, with examples: http://stackoverflow.com/questions/9848556/correct-terms-and-words-for-sections-and-parts-of-selectors — BoltClock, Jul 27 '13 at 06:52
And speaking of combinators, `*` is not a combinator. It's the universal selector. Although the OP is right in that `*` may substitute a type selector. — BoltClock, Jul 27 '13 at 07:04
@BoltClock ok thanks; I went and read over the syntax content in the W3C spec. I couldn't find anything like an "official" formal grammar of any sort, though I suppose one could consider that part of the spec to be the formal grammar. — Pointy, Jul 27 '13 at 13:03
@Pointy: If you're referring to a stable spec, see http://www.w3.org/TR/selectors Either way, that *is* the formal grammar of Selectors. — BoltClock, Jul 27 '13 at 13:20

Denys Séguret · Answer 1 · 2013-07-26T18:18:13.520

You might do the splitting using

var tokens = subselector.split(/(?=\.)|(?=#)|(?=\[)/)

which changes

div.someClass#id[key=value][key2=value2].anotherClass

to

["div", ".someClass", "#id", "[key=value]", "[key2=value2]", ".anotherClass"]

and after that you simply have to look how starts each token (and, in case of tokens starting with [, checking if they contain a =).

Here's the whole working code building exactly the object you describe :

function parse(subselector) {
  var obj = {tags:[], classes:[], ids:[], attrs:[]};
  subselector.split(/(?=\.)|(?=#)|(?=\[)/).forEach(function(token){
    switch (token[0]) {
      case '#':
         obj.ids.push(token.slice(1));
        break;
      case '.':
         obj.classes.push(token.slice(1));
        break;
      case '[':
         obj.attrs.push(token.slice(1,-1).split('='));
        break;
      default :
         obj.tags.push(token);
        break;
    }
  });
  return obj;
}

demonstration

This is a great start, though I do agree with @Gumbos point. Is there a way to make the attribute search 'greedier' than the other searches to avoid this problem? — Darius, Jul 26 '13 at 18:15
@Gumbo I answered the written question, not another question about any kind of CSS selector because trying to do it in a few lines of javascript would be doomed. — Denys Séguret, Jul 26 '13 at 18:27

Javascript efficient parsing of css selector

1 Answers1