6

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.

I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.

Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.

   /* helo
world*/  #  /* hel
l
o
*/  /*world */ifdef .....

is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)

This's what I am doing currently:

PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR); 

But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.

However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.

If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.

JavaMan
  • 4,954
  • 4
  • 41
  • 69

2 Answers2

5

You can use semantic predicates for the conditions.

PPLINE
    :   {getCharPositionInLine() == 0}?
        (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
        {_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
        -> channel(PPDIR)
    ;
Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • In Terrance Parr's book, semantic predicates are said to appear on the right edge of lexer rules. How should we interpret your example having semantic predicates appearing on the left edge ? – JavaMan May 06 '13 at 08:03
  • In ANTLR 4, semantic predicates can appear anywhere in a *lexer* rule, and they'll be evaluated at the point where they appear. Parser rules are a bit more restrictive - predicates can only appear on the left edge of a decision. – Sam Harwell May 06 '13 at 13:06
  • `NameError: name 'getCharPositionInLine' is not defined` – nmz787 May 20 '20 at 23:07
  • same problem here: ReferenceError: getCharPositionInLine is not defined does this not exist in JavaScript ? – rednoyz Jan 19 '21 at 17:23
  • actually seems you need to use 'this.column' instead (documentation is not great) – rednoyz Jan 19 '21 at 18:39
1

You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.

Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).

Community
  • 1
  • 1
lunadir
  • 339
  • 3
  • 15