ANTLR4 Lexer Matching Start of Line End Of Line

Question

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.

I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.

Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.

   /* helo
world*/  #  /* hel
l
o
*/  /*world */ifdef .....

is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)

This's what I am doing currently:

PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR);

But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.

However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.

If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.

score 5 · Accepted Answer · answered May 05 '13 at 17:37

5

You can use semantic predicates for the conditions.

PPLINE
    :   {getCharPositionInLine() == 0}?
        (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
        {_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
        -> channel(PPDIR)
    ;

answered May 05 '13 at 17:37

Sam Harwell

97,721
20
209
280

In Terrance Parr's book, semantic predicates are said to appear on the right edge of lexer rules. How should we interpret your example having semantic predicates appearing on the left edge ? – JavaMan May 06 '13 at 08:03
In ANTLR 4, semantic predicates can appear anywhere in a *lexer* rule, and they'll be evaluated at the point where they appear. Parser rules are a bit more restrictive - predicates can only appear on the left edge of a decision. – Sam Harwell May 06 '13 at 13:06
`NameError: name 'getCharPositionInLine' is not defined` – nmz787 May 20 '20 at 23:07
same problem here: ReferenceError: getCharPositionInLine is not defined does this not exist in JavaScript ? – rednoyz Jan 19 '21 at 17:23
actually seems you need to use 'this.column' instead (documentation is not great) – rednoyz Jan 19 '21 at 18:39

score 1 · Answer 2 · edited May 23 '17 at 12:10

You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.

Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).

ANTLR4 Lexer Matching Start of Line End Of Line

2 Answers2