3

What would be the most efficient way of parsing a css selector input string, that features any combination of:

  • [key=value] : attributes, 0 to * instances
  • #id : ids, 0 to 1 instances
  • .class : classes, 0 to * instances
  • tagName : tag names, 0 to 1 instances (found at start of string only)

(note: '*', or other applicable combinator could be used in lieu of tag?)

Such as:

div.someClass#id[key=value][key2=value2].anotherClass

Into the following output:

['div','.someClass','#id','[key=value]','[key2=value2]','.anotherClass']

Or for bonus points, into this form efficiently (read: a way not just based on using str[0] === '#' for example):

{
 tags : ['div'],
 classes : ['someClass','anotherClass'],
 ids : ['id'],
 attrs : 
   {
     key : value,
     key2 : value2
   }
}

(note removal of # . [ = ])

I imagine some combination of regex and .match(..) is the way to go, but my regex knowledge is nowhere near advanced enough for this situation.

Many thanks for your help.

Darius
  • 5,180
  • 5
  • 47
  • 62
  • 3
    regex is rarely the right solution for complex languages parsing. You should have a look at the many libraries doing this (like sizzle) – Denys Séguret Jul 26 '13 at 18:04
  • I know sizzle does it, but I'm looking to implement my own simple solution. The domain is not as complex as a language, there is no whitespace etc, and a limited format for delimiters (as listed in the question) – Darius Jul 26 '13 at 18:05
  • I was suggering to look at the source, not using it. If you want to parse css selectors, you should take whitespaces into account. – Denys Séguret Jul 26 '13 at 18:06
  • OK I will consult the source, but I'm talking about tokens already split by whitespace. This question is about the next step after splitting the tokens delimited by whitespace – Darius Jul 26 '13 at 18:07
  • @dystroy I think this is about parsing the selector "sub-syntax" for a single element match; I'm not sure what that's called. Also SCRIPTONITE note that it's not just splitting on whitespace - whitespace is an **operator** in the CSS selector syntax, comparable to the `+` and `~` connectors. – Pointy Jul 26 '13 at 18:07
  • Also, to further clarify, this might be implemented server side, which means it can't be based on Sizzle's use of browser native methods – Darius Jul 26 '13 at 18:08
  • @Pointy I take your point. But I think focusing on the subselectors first, before attempting more complex connections is a good approach – Darius Jul 26 '13 at 18:16
  • How do you know the order of the selectors in your second example? – Gumbo Jul 26 '13 at 18:19
  • Is it a requirement to preserve the order? The node has to match all the conditions regardless? Correct me if I'm wrong though! – Darius Jul 26 '13 at 18:23
  • You **really** need to write your requirements. If you want to implement any CSS selector (including things like `:not`), then it's not a light project that will be answered here. – Denys Séguret Jul 26 '13 at 18:23
  • @dystroy I think pseudo selectors are a future iteration. At this stage the requirements are the subset of qualifiers listed. Your solution below really helps – Darius Jul 26 '13 at 18:28
  • @Pointy: The "sub-syntax" for a single element match is known as a compound selector; its components are called simple selectors. `+`, `~` and whitespace (where it matters!) are known as combinators. A series of compound selectors and combinators is called a complex selector. The terminology is taken from [Selectors 4](http://www.w3.org/TR/selectors4); the Selectors 3 recommendation has different names but those are very confusing. I wrote a full answer here, with examples: http://stackoverflow.com/questions/9848556/correct-terms-and-words-for-sections-and-parts-of-selectors – BoltClock Jul 27 '13 at 06:52
  • And speaking of combinators, `*` is not a combinator. It's the universal selector. Although the OP is right in that `*` may substitute a type selector. – BoltClock Jul 27 '13 at 07:04
  • @BoltClock ok thanks; I went and read over the syntax content in the W3C spec. I couldn't find anything like an "official" formal grammar of any sort, though I suppose one could consider that part of the spec to be the formal grammar. – Pointy Jul 27 '13 at 13:03
  • @Pointy: If you're referring to a stable spec, see http://www.w3.org/TR/selectors Either way, that *is* the formal grammar of Selectors. – BoltClock Jul 27 '13 at 13:20
  • @BoltClock oh now I see it (#grammar). Duhh. Thanks! – Pointy Jul 27 '13 at 13:25

1 Answers1

10

You might do the splitting using

var tokens = subselector.split(/(?=\.)|(?=#)|(?=\[)/)

which changes

div.someClass#id[key=value][key2=value2].anotherClass

to

["div", ".someClass", "#id", "[key=value]", "[key2=value2]", ".anotherClass"]

and after that you simply have to look how starts each token (and, in case of tokens starting with [, checking if they contain a =).

Here's the whole working code building exactly the object you describe :

function parse(subselector) {
  var obj = {tags:[], classes:[], ids:[], attrs:[]};
  subselector.split(/(?=\.)|(?=#)|(?=\[)/).forEach(function(token){
    switch (token[0]) {
      case '#':
         obj.ids.push(token.slice(1));
        break;
      case '.':
         obj.classes.push(token.slice(1));
        break;
      case '[':
         obj.attrs.push(token.slice(1,-1).split('='));
        break;
      default :
         obj.tags.push(token);
        break;
    }
  });
  return obj;
}

demonstration

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
  • This is a great start, though I do agree with @Gumbos point. Is there a way to make the attribute search 'greedier' than the other searches to avoid this problem? – Darius Jul 26 '13 at 18:15
  • 1
    @Gumbo I answered the written question, not another question about any kind of CSS selector because trying to do it in a few lines of javascript would be doomed. – Denys Séguret Jul 26 '13 at 18:27