9

I'm trying to parse CSS, or at least the basics, using ANTLR. I'm running into a few problems with my lexer rules though. The problem lies in the ambiguity between an ID selectors and hexadecimal color values. Using a simplified grammar for clarity, consider the following input:

#bbb {
  color: #fff;
}

and the following parser rules:

ruleset : selector '{' property* '}';
selector: '#' ALPHANUM;
property: ALPHANUM ':' value ';' ;
value: COLOR;

and these lexer tokens:

ALPHANUM : ('a'..'z' | '0'..'9')+;
COLOR : '#' ('0'..'9' | 'a'..'f')+;

This will not work, because #bbb is tokenized as a COLOR token, even though it's supposed to be a selector. If I change the selector to not start with a hexadecimal character, it works fine. I'm not sure how to solve this. Is there a way to tell ANTLR to treat a specific token only as a COLOR token if it's in a certain position? Say, if it's in a property rule, I can safely assume it's a color token. If it isn't, treat it as a selector.

Any help would be appreciated!


Solution: Turns out I was trying to do too much in the grammar, which I should probably deal with in the code using the AST. CSS has too many ambiguous tokens to reliably split into different tokens, so the approach I'm using now is basically tokenizing the special characters like '#', '.', ':' and the curly braces, and doing post processing in the consumer code. Works a lot better, and it's easier to deal with the edge cases.

molf
  • 73,644
  • 13
  • 135
  • 118
Erik van Brakel
  • 23,220
  • 2
  • 52
  • 66

4 Answers4

8

Try moving the # in your lexer file from COLOR to its own thing, as such:

LLETTERS: ( 'a'..'z' )
ULETTERS: ( 'A'..'Z' )
NUMBERS: ( '0'..'9' )
HASH : '#';

Then, in your parser rules, you can do it like this:

color: HASH (LLETTERS | ALPHANUM)+;
selector: HASH (ULETTERS | LLETTERS) (ULETTERS | LLETTERS | NUMBERS)*;

etc.

This allows you to specify the difference grammatically, which can roughly be described as contextually, versus lexically, which can roughly be described as by appearance. If something's meaning changes depending on where it is, that difference should be specified in the grammar, not the lexer.

Note that color and selector are quite the same definition. Lexers are typically a separate stage from the module that translates the input string to a grammar, so it is invalid to have an ambiguous lexicon (as was pointed out, bbb could be hex or it could be a lowercase letter string). Thus, data validity checking needs to be done elsewhere.

Walt W
  • 3,261
  • 3
  • 30
  • 37
  • This still doesn't work. the problem is that bbb (or anything that starts with 0..9 | a..f) will be tokenized as HEXSTRING. This will prevent #bbb to be matched as a selector. – Erik van Brakel Aug 24 '09 at 23:41
  • well, actually I was backwards there. I believe that since bbb is both a valid string AND a valid hexstring, you will need to do software-side data validity checking. – Walt W Aug 24 '09 at 23:46
  • That's what I'm afraid of. Hopefully there's an antlr guru running around here on stackoverflow who can prove you wrong :/ – Erik van Brakel Aug 24 '09 at 23:49
  • Yeah there might be a better way. But that should work . . sorry, I apparently haven't written a parser in awhile :-[ – Walt W Aug 24 '09 at 23:52
  • It seems that your example isn't complete like it is. – CSchulz Apr 21 '15 at 08:33
2

To ditto what Walt said, Appendix G. Grammar of CSS 2.1 says to lex HASH, and then (depending on its position relative to other token) to parse a HASH either as a simple_selector or as a hexcolor.

The lexer defines the following token ...

"#"{name}       {return HASH;}

... and the grammar includes the following rules ...

hexcolor
  : HASH S*
  ;

simple_selector
  : element_name [ HASH | class | attrib | pseudo ]*
  | [ HASH | class | attrib | pseudo ]+
  ;

This means that a parser based on the grammar would allow a non-hex hexcolor.

I'd detect a non-hex hexcolor later, in code which analyzes/interprets the lexed+parsed syntax tree.

ChrisW
  • 54,973
  • 13
  • 116
  • 224
  • Yes, I am familiar with that appendix. It's what I use as one of my sources for the grammar I'm building. Doesn't solve the problem for me though :( – Erik van Brakel Aug 24 '09 at 23:42
  • @Erik: Have you taken a look at the CSS grammar available at http://www.antlr.org/grammar/list – Vineet Reynolds Aug 24 '09 at 23:48
  • Yes, I've taken a look at the CSS 3 grammar, it shows the same error. – Erik van Brakel Aug 24 '09 at 23:59
  • "Doesn't solve the problem for me though" -- What problem? If you implement the grammar as specified in the spec, it works. Perhaps your problem is that you're trying to rewrite the specified grammar to make it stricter than it is in the spec, perhaps (I don't know why) to move error-checking into the parser. – ChrisW Aug 25 '09 at 00:05
  • Maybe I misinterpreted it, or I'm doing something wrong with ANTLR. I'll look into it again, and get back to this question at that time. Probably tomorrow or the day after. – Erik van Brakel Aug 25 '09 at 00:06
  • Rather than ANTLR (which I didn't try) I made the specified grammar machine-readable using the GOLD Parser, but anyway that worked. – ChrisW Aug 25 '09 at 00:10
0

To make a decision from multiple alternatives, ANTLR has two options,

  • syntactic predicates
  • semantic predicates

This is from antlr grammar lib (css2.1 g):

simpleSelector
    : elementName 
        ((esPred)=>elementSubsequent)*

    | ((esPred)=>elementSubsequent)+
    ;

esPred
    : HASH | DOT | LBRACKET | COLON
    ;

elementSubsequent
    : HASH
    | cssClass
    | attrib
    | pseudo
    ;

cssClass
    : DOT IDENT
    ;

elementName
    : IDENT
    | STAR
    ;

This is used for syntactic predicates.

Link to grammar: http://www.antlr.org/grammar/1240941192304/css21.g

ЯegDwight
  • 24,821
  • 10
  • 45
  • 52
0

Just came here by googling,and found a good resource, a real implimentation. For those who come and search for a complete CSS Antlr grammer, then take a look at this grammar file. This can give you an idea or you can directly use it.

diyoda_
  • 5,274
  • 8
  • 57
  • 89