ANTLR get and split lexer content

Question

first, sorry about my english, i still learning.

I writing Python module for my framework, which parsing CSS files. I try regex, ply (python lexer and parser), but i found myself in ANTLR.

First try, i need to parse comments from CSS file. This is my CSS string to parse:

/*test*/

/*
test1
/*

/*test2/*nested*/comment/*

I know that CSS doesn't allow nested comments, but i need it in my framework. I wrote simple ANTLR grammar:

grammar CSS;

options {
    language = Python;
}

styleSheet
    : comments EOF ;

comments
    : NESTED_ML_COMMENT*
    ;

NESTED_ML_COMMENT
    :   '/*' 
        (options {greedy=false;} : (NESTED_ML_COMMENT | . ) )* 
        '*/' 
    ;

LINEBREAK 
    :  ('\n\r' | '\n')+{$channel=HIDDEN; };

What i get in result is:

enter image description here

What i expect (paint work :D):

enter image description here

Notice that i don't want /* and */ in result.

Is there any way to do this in pure ANTLR? I have no problem with using python in ANTLR, but if there any way to do this without python i will be grateful.

score 2 · Accepted Answer · edited May 23 '17 at 12:01

No, there is no easy way. Since NESTED_ML_COMMENT is a lexer rule (a "simple" token), you cannot let a parser rule create any more structure in source like /*test2/*nested*/comment*/: lexer rules will always stay a "flat" sequence of characters. Sure, there are (easy) ways to rewrite this character sequence (ie. remove /* and */), but creating parent-sibling hierarchies, no.

In order to create a hierarchy like you displayed in your 2^nd image, you will have to "promote" your comment-rule to the parser (so make it into a parser rule). In that case, your lexer would have a COMMENT_START : '/*'; and COMMENT_END : '*/'; rule. But that opens a can of worms: inside your lexer you would now also need to account for all characters that can come between /* and */.

You could create another parser that parses (nested) comments and use that inside your CSS grammar. Inside your CSS grammar, you simply keep it as it is, and your second parser is a dedicated comments-parser that creates a hierarchy from the comment-tokens.

A quick demo. The grammar:

grammar T;

parse
  :  comment EOF 
  ;

comment
  :  COMMENT_START (ANY | comment)* COMMENT_END
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
ANY           :  . ;

will parse the source /*test2/*nested*/comment*/ into the following parse tree:

enter image description here

which you can rewrite so that /* and */ are removed, of course.

Inside your CSS grammar, you then do:

comment
  :  NESTED_ML_COMMENT 
     {
       text = $NESTED_ML_COMMENT.text
       # invoke the TParser (my demo grammar) on `text`
     }
  ;

EDIT

Note that ANTLRWorks creates it's own internal parse tree to which you have no access. If you do not tell ANTLR to generate a proper AST, you will just end up with a flat list of tokens (even though ANTLRWorks suggests it is some sort of tree).

Here's a previous Q&A that explains how to create a proper AST: How to output the AST built using ANTLR?

Now let's get back to the "comment" grammar I posted above. I'll rename the ANY rule to TEXT. At the moment, this rule only matches a single character at a time. But it's more convenient to let it match all the way up to the next /* or */. This can be done by introducing a plain Python method in the lexer class that performs this check. Inside the TEXT rule, we'll use that method inside a predicate so that * gets matched if it's not directly followed by a /, and a / gets matched if it's not directly followed by a *:

grammar Comment;

options {
  output=AST;
  language=Python;
}

tokens {
  COMMENT;
}

@lexer::members {
  def not_part_of_comment(self):
    current = self.input.LA(1)
    next = self.input.LA(2)
    if current == ord('*'): return next != ord('/')
    if current == ord('/'): return next != ord('*')  
    return True
}

parse
  :  comment EOF -> comment
  ;

comment
  :  COMMENT_START atom* COMMENT_END -> ^(COMMENT atom*)
  ;

atom
  :  TEXT
  |  comment
  ;

COMMENT_START : '/*';
COMMENT_END   : '*/';
TEXT          : ({self.not_part_of_comment()}?=> . )+ ;

Find out more about the predicate syntax, { boolean_expression }?=>, in this Q&A: What is a 'semantic predicate' in ANTLR?

To test this all, make sure you have the proper Python runtime libraries installed (see the ANTLR Wiki). And be sure to use ANTLR version 3.1.3 with this runtime.

Generate the lexer- and parser like this:

java -cp antlr-3.1.3.jar org.antlr.Tool Comment.g

and test the lexer and parser with the following Python script:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *

# http://www.antlr.org/wiki/display/ANTLR3/Python+runtime
# http://www.antlr.org/download/antlr-3.1.3.jar

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = '/*aaa1/*bbb/*ccc*/*/aaa2*/'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CommentLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CommentParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

As you can see, from the source "/*aaa1/*bbb/*ccc*/*/aaa2*/", the following AST is created:

COMMENT
   aaa1
   COMMENT
      bbb
      COMMENT
         ccc
   aaa2

EDIT II

I mind as well show how you can invoke the Comment parser from your CSS grammar. Here's a quick demo:

grammar CSS;

options {
  output=AST;
  language=Python;
}

tokens {
  CSS_FILE;
  RULE;
  BLOCK;
  DECLARATION;
}

@parser::header {
import antlr3
from antlr3 import *
from antlr3.tree import *
from CommentLexer import *
from CommentParser import *
}

@parser::members {
  def parse_comment(self, text):
    lexer = CommentLexer(antlr3.ANTLRStringStream(text))
    parser = CommentParser(antlr3.CommonTokenStream(lexer))
    return parser.parse().tree 
}

parse
  :  atom+ EOF -> ^(CSS_FILE atom+)
  ;

atom
  :  rule
  |  Comment -> {self.parse_comment($Comment.text)}
  ;

rule
  :  Identifier declarationBlock -> ^(RULE Identifier declarationBlock)
  ;

declarationBlock
  :  '{' declaration+ '}' -> ^(BLOCK declaration+)
  ;

declaration
  :  a=Identifier ':' b=Identifier ';' -> ^(DECLARATION $a $b)
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
  ;

Comment
  :  '/*' (options {greedy=false;} : Comment | . )* '*/'
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

If you parse the source:

h1 {  a: b;  c: d;}

/*aaa1/*bbb/*ccc*/*/aaa2*/

p {x  :  y;}

with the CSSParser, you'll get the following tree:

CSS_FILE
   RULE
      h1
      BLOCK
         DECLARATION
            a
            b
         DECLARATION
            c
            d
   COMMENT
      aaa1
      COMMENT
         bbb
         COMMENT
            ccc
      aaa2
   RULE
      p
      BLOCK
         DECLARATION
            x
            y

as you can see by running this test script:

#!/usr/bin/env python

import antlr3
from antlr3 import *
from antlr3.tree import *
from CSSLexer import *
from CSSParser import *

def print_level_order(tree, indent):
  print '{0}{1}'.format('   '*indent, tree.text)
  for child in tree.getChildren():
    print_level_order(child, indent+1)

input = 'h1 {  a: b;  c: d;}\n\n/*aaa1/*bbb/*ccc*/*/aaa2*/\n\np {x  :  y;}'
char_stream = antlr3.ANTLRStringStream(input)
lexer = CSSLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = CSSParser(tokens)
tree = parser.parse().tree 
print_level_order(tree, 0)

@Bart, you read in my mind. I twice preparing to answer you and give another question, but then refresh page and i see you already answered it! :) Give me a minute to read you'r answer, understand it and check if it works. — Galmi, May 08 '11 at 21:29
You might need a bit more than a few minutes to properly implement this :). I'm logging off right now, but feel free to ask questions if you have them: I'll probably have a look tomorrow. — Bart Kiers, May 08 '11 at 21:32
@Bart: I tested your grammar. I notice that are is a few bugs. First, and only important is that when i parse string with nested comments and try to take tree in python. Unfortunately, tree is flat array, no subtrees, only flat char array. In ANTLRWorks seems to work well, according to generated image and tree view. I realize that may be only Python runtime error, but this is only language i know. Can you test your grammar with another runtime? — Galmi, May 09 '11 at 02:58
@Galmi, no bugs: just a misconception of how ANTLR works (luckily :)). See my **EDIT**. — Bart Kiers, May 09 '11 at 08:47
@Bart, i get it. Perfectly clear written, so even i can understood it at first time :) This is exactly what i want and even more. Thank you very much for you time, your code is perfectly clear for me and helped me a lot to understand how ANTLR works and how powerful it is. — Galmi, May 09 '11 at 14:08

score 0 · Answer 2 · edited May 23 '17 at 12:01

0

You should use the ! and ^ AST hints. To make /* not appear in your AST, put ! after it. To control which elements become roots of AST subtrees, append ^. It might look something like this:

NESTED_ML_COMMENT
:   COMMENT_START!
    (options {greedy=false;} : (NESTED_ML_COMMENT^ | . ) )* 
    COMMENT_END!
;

Here's a question specifically about these operators, which now that you know exist, I hope will be useful: What does ^ and ! stand for in ANTLR grammar

edited May 23 '17 at 12:01

Community

1
1

answered May 08 '11 at 19:44

John Zwinck

239,568
38
324
436

No, inside lexer rules, `^` and `!` have no special meaning. These are only used in parser rules. – Bart Kiers May 08 '11 at 20:41
OK, you're right. Nonetheless, these hints will probably help fix the problem, even if my specific example was not quite right. – John Zwinck May 08 '11 at 21:00
First, @John, thanks for you'r help. For now i even don't know about AST hints :) But like @Bart mentioned, AST hints don't work in lexer rules (i figured it out after hour messing with this code). Anyway, thanks for this link, it very helpful for learning ANTLR. – Galmi May 08 '11 at 21:25

ANTLR get and split lexer content

2 Answers2

EDIT

EDIT II

Linked