3

I'm parsing CSS3 selectors using a regex. For example, the selector a>b,c+d is broken down into:

  Selector:
    a>b
    c+d
  SOSS:
    a
    b
    c
    d
  TypeSelector:
    a
    b
    c
    d
  Identifier:
    a
    b
    c
    d
  Combinator:
    >
    +

The problem is, for example, I don't know which selector the > combinator belongs to. The Selector Group has 2 captures (as shown above), each containing 1 combinator. I want to know what that combinator is for that capture.

Groups have lists of Captures, but Captures don't have lists of Groups found in that Capture. Is there a way around this, or should I just re-parse each selector?


Edit: Each capture does give you the index of where the match occurred though... maybe I could use that information to determine what belongs to what?


So you don't think I'm insane, the syntax is actually quite simple, using my special dict class:

var flex = new FlexDict
    {
        {"GOS"/*Group of Selectors*/, @"^\s*{Selector}(\s*,\s*{Selector})*\s*$"},
        {"Selector", @"{SOSS}(\s*{Combinator}\s*{SOSS})*{PseudoElement}?"},
        {"SOSS"/*Sequence of Simple Selectors*/, @"({TypeSelector}|{UniversalSelector}){SimpleSelector}*|{SimpleSelector}+"},
        {"SimpleSelector", @"{AttributeSelector}|{ClassSelector}|{IDSelector}|{PseudoSelector}"},

        {"TypeSelector", @"{Identifier}"},
        {"UniversalSelector", @"\*"},
        {"AttributeSelector", @"\[\s*{Identifier}(\s*{ComparisonOperator}\s*{AttributeValue})?\s*\]"},
        {"ClassSelector", @"\.{Identifier}"},
        {"IDSelector", @"#{Identifier}"},
        {"PseudoSelector", @":{Identifier}{PseudoArgs}?"},
        {"PseudoElement", @"::{Identifier}"},

        {"PseudoArgs", @"\([^)]*\)"},

        {"ComparisonOperator", @"[~^$*|]?="},
        {"Combinator", @"[ >+~]"},

        {"Identifier", @"-?[a-zA-Z\u00A0-\uFFFF_][a-zA-Z\u00A0-\uFFFF_0-9-]*"},

        {"AttributeValue", @"{Identifier}|{String}"},
        {"String", @""".*?(?<!\\)""|'.*?(?<!\\)'"},
    };
mpen
  • 272,448
  • 266
  • 850
  • 1,236

3 Answers3

1

You shouldn't write one regex to parse the whole thing. But first get the selectors and then get the combinator for each of them. (At least that's how you would parse your example, real CSS is going to be more complicated.)

svick
  • 236,525
  • 50
  • 385
  • 514
  • Why not? How am I going to know if it's valid? How would I "get the selectors"? Split on comma? What if that comma is in an attribute selector, or in quotes? Then I can't split there. I find it much easier to parse the whole thing with one regex. Real CSS is going to be more complicated? Here... let me paste the regex for you, and then you can give me an example it can't parse. – mpen May 08 '11 at 23:21
  • @Mark, I'm not saying you can't parse CSS with one regex. I'm saying it's insane to it. If there is a mistake in it, it would be very hard to debug. And it doesn't give you the structural information you need. – svick May 08 '11 at 23:52
  • I don't think so...do you think I wrote that one big regex by hand? No. It's composed of 17 small, easy to read regexes. I can test each individually if there's a mistake. It is missing some of the structural information though. It labels all the parts correctly, but it doesn't give me the hierarchy I need. Which is why I'm thinking I run it once to validate the selector, and then run sub-regexes on each part to drill-down and get the information I need. Probably a bit inefficient to re-run all the regexes, but... still the easiest solution I've found. – mpen May 09 '11 at 00:03
1

Each capture does give you the index of where the match occurred though... maybe I could use that information to determine what belongs to what?

Just thinking aloud here; you could pick out each match in the Selector group, get its starting and ending indices relative to the entire match and see if the index of each combinator falls within the start and end index range. If the combinator's index falls within the range, it occurs in that selector.

I'm not sure how this would fare in terms of performance though. But I think you could make it work.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
  • Yes, that's what I was thinking. And all the indices are sorted in ascending order already, so I could perform binary searches to speed things up if necessary. – mpen May 09 '11 at 00:19
1

I wouldn't recommend using regex for parsing anything. Except for very simple cases parsers are almost always a better choice. Take a look at this question.

Is there a CSS parser for C#?

Community
  • 1
  • 1
TheLukeMcCarthy
  • 2,253
  • 2
  • 25
  • 34
  • I was thinking about using Irony, but it seemed overkill, plus it ignores spaces, which is bad. So... dunno. Seems like it might be difficult to find a pre-existing parser built just for this task. What I need is just a simple AST. – mpen May 09 '11 at 00:22
  • @Mark It may seem like over kill now, however simple thing tend to grow and become more complex over time. Also once you have a regex parsing something, when you need to extend/change it can be quite painful. I'm not anti-regex I just want to point out the pain points you may encounter. I can't recommend a css parser as I've never needed one. If you do use regex it is probably better to use a few small statements rather than one big one (like @svivk said). – TheLukeMcCarthy May 09 '11 at 16:16
  • Well, as my update shows, it actually is broken down into pretty small patterns. I see your point though. I don't think the CSS3 spec is going to change too much though, so I think I'll chance it. I did try parsing it manually without regexes before, but it was really painful and didn't work all that well. – mpen May 09 '11 at 16:33
  • @Mark I didn't think CSS3 will change that much either, but your usage of it almost certainly will. Regex will be better than starting from scratch because you should end up with less lines of code. – TheLukeMcCarthy May 10 '11 at 08:59
  • Everything changes though. The question is whether or not it's maintainable. This looks pretty clean and maintainable to me. – mpen May 10 '11 at 15:31