4

I am trying to define lexer rules for PostgreSQL SQL.

The problem is with the operator definition and the line comments conflicting with each other.

for example @--- is an operator token @- followed by the -- comment and not an operator token @---

In grako it would be possible to define the negative lookahead for the - fragment like:

OP_MINUS: '-' ! ( '-' ) .

In ANTLR4 I could not find any way to rollback already consumed fragment.

Any ideas?

Here the original definition what the PostgreSQL operator can be:

The operator name is a sequence of up to NAMEDATALEN-1
(63 by default) characters from the following list:

 + - * / < > = ~ ! @ # % ^ & | ` ?

There are a few restrictions on your choice of name:
-- and /* cannot appear anywhere in an operator name,
since they will be taken as the start of a comment.

A multicharacter operator name cannot end in + or -,
unless the name also contains at least one of these
characters:

~ ! @ # % ^ & | ` ?

For example, @- is an allowed operator name, but *- is not.
This restriction allows PostgreSQL to parse SQL-compliant
commands without requiring spaces between tokens.
valgog
  • 2,738
  • 1
  • 20
  • 16
  • Can you give a more specific example of what you're trying to do, what you already attempted, and why that didn't solve your problem? – Sam Harwell Jun 13 '14 at 04:29
  • So I need the lexer to return Op class, that (for simplification) can contain `+`, `-`, `*` and `/` in any combination. But `--` and `/*` start the comment, and the lexer should be able to return `+--this_is_plus`as two tokens: `Op(+)` and `LineComment(--this_is_plus)` and not as `Op(+--)` and `Ident(this_is_plus)` – valgog Jun 13 '14 at 07:52
  • Have you tried it? What you are describing that you want is the only way ANTLR works. – Sam Harwell Jun 13 '14 at 10:45
  • The operator token always consists of a `@` followed by exactly 1 character? – Onur Jun 13 '14 at 10:50
  • No, the operator token is 1 to 63 character long. – valgog Jun 13 '14 at 12:40

1 Answers1

6

You can use a semantic predicate in your lexer rules to perform lookahead (or behind) without consuming characters. For example, the following covers several rules for an operator.

OPERATOR
  : ( [+*<>=~!@#%^&|`?]
    | '-' {_input.LA(1) != '-'}?
    | '/' {_input.LA(1) != '*'}?
    )+
  ;

However, the above rule does not address the restrictions on including a + or - at the end of an operator. To handle that in the easiest way possible, I would probably separate the two cases into separate rules.

// this rule does not allow + or - at the end of a rule
OPERATOR
  : ( [*<>=~!@#%^&|`?]
    | ( '+'
      | '-' {_input.LA(1) != '-'}?
      )+
      [*<>=~!@#%^&|`?]
    | '/' {_input.LA(1) != '*'}?
    )+
  ;

// this rule allows + or - at the end of a rule and sets the type to OPERATOR
// it requires a character from the special subset to appear
OPERATOR2
  : ( [*<>=+]
    | '-' {_input.LA(1) != '-'}?
    | '/' {_input.LA(1) != '*'}?
    )*
    [~!@#%^&|`?]
    OPERATOR?
    ( '+'
    | '-' {_input.LA(1) != '-'}?
    )+
    -> type(OPERATOR)
  ;
Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • wow, I'm having a real hard time deciphering this x'] haha.. To put the first rule into words, An operator matches: Either one of : `[*<>=~!@#%^&|'?]`, or ((1 or many : `+` or (`-` without another `-` ahead)) followed by either of : `[*<>=~!@#%^&|'?]`), or a `/` without a `*` ahead of it. Is this correct ? Or did i misunderstand some part of it – Lorenzo Jan 25 '21 at 10:17