2

I'm trying to create a Syntax Analyzer in Java (using CUP) that could recognise this piece of code:

if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;

My productions used by the "if" statement are these that follow:

Instr ::= ...
       | IF CONOP Exp:e CONCL THEN CondInstrList:l
       ...
       ;
...
CondInstrList ::= CondInstrList CondInstr
       | /*empty*/
       ;
...
CondInstr ::= CONTROLD Instr
       | CONTROLD CondInstr
       ;

where Instr stands for instruction/statement, CondInstrList stands for Conditional Instruction List and CONTROLD stands for Control Dash (~). (CONOP and CONCL mean Condition Open/Close)

The problem is that with that grammar, the generated AST is as follows:

if
|-condition b
|-condInstrListT
  |---asig a = 2
  |---if
      |---condition b and c
      |---condInstrListT 
      |   |---asig a = 2
      |---condInstrListF
          |---asig a = 4

and so, the "else" part is associated with the inner "if".

I just don't know how to write a grammar that respects the way I want my language to be.

Any help is appreciated.

I can give more detail if needed.

1 Answers1

1

I don't think you can do what you have in mind through the grammar alone. But it is possible with a slightly different grammar and some help from your lexical analyzer.

Here's what to do: rather than treat the ~ marks as individual grammar symbols, have the lexical analyzer turn sequences of ~ at the start of a line into INDENT and OUTDENT tokens that will work in your grammar the same way that { and } work in Java. You keep track of a "current indent level", which starts at zero. At the start of each line, count the ~ characters. For each ~ in excess of the current indent level, generate an INDENT token and increase the current indent level; for each ~ less than the current indent level, generate an OUTDENT token and decrease the current indent level.

So your example text of

if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;

would be tokenized as:

// Indent level = 0 and no ~, so no INDENT here
[IF] [CONOP] [ID b] [CONCL] [THEN]
// Indent level = 0, one ~, so one INDENT
[INDENT]
    // Indent level = 1
    [ID a] [OP =] [CONST 2] [SEMICOLON]
    // Indent level = 1, one ~, so no INDENT here
    [IF] [CONOP] [ID b] [OP &&] [ID c] [CONCL] [THEN]
    // Indent level = 1, two ~, so one INDENT
    [INDENT]
        // Indent level = 2
        [ID a] [ASSIGN] [CONST 3] [SEMICOLON]
        // Indent level = 2, lines starts with no ~, two OUTDENTs
    [OUTDENT]
    // Indent level = 1
[OUTDENT]
//Indent level = 0
[ELSE] // No ~ at start of this line, so no INDENT
// Indent level = 0; one ~, so one INDENT
[INDENT] 
    // Indent level = 1
    [ID a] [ASSIGN] [CONST 4] [SEMICOLON]
// End-of-input.  Indent level = 1, so 1 OUTDENT
[OUTDENT]
// Done; indent level = 0;

The INDENT and OUTDENT tokens would act in your grammar like left- and right-braces do in Java, so your grammar could look something like:

Instr ::= ...
       | IF CONOP Exp:e CONCL THEN INDENT CondInstrList:l OUTDENT
       ...
       ;
...
CondInstrList ::= CondInstrList Instr
       | /*empty*/
       ;
...

The language Python does the same thing but with just white space instead of ~. You can download Python source here if you're interested. Look for the files Grammar\Grammar and Parser\tokenizer.c.

Kevin Anderson
  • 4,568
  • 3
  • 13
  • 21
  • That's exactly the solution. Now I'm having a problem with JLex. I store the indentation level in an attribute called `indent` and each end of line I reset the counter of the actual indentation level (attribute `actual`) to 0. This way, when there's one dash more than the actual `indent` level I return an INDENT symbol. The problem raises when I have to return the OUTDENT. I just can't figure out how to do it as JLex only returns one value. I can find out how many "indentation levels" I have to substract but I have no way to return as many OUTDENT symbols as required. – Javier Sagredo May 17 '17 at 15:14
  • I think you'd need to be able to re-read input in order to generate the correct OUTDENT tokens directly in the lexical analyzer, and I'm not seen any evidence that JLex can do that. You'll probably need to have an intermediate layer between the parser and the lexical analyzer which could receive information about the indent level of each line from the lexical analyzer and figure out the right number of INDENT or OUTDENT tokens to return to the scanner. – Kevin Anderson May 18 '17 at 02:38
  • That last sentence should read "...tokens to return to the **parser**." – Kevin Anderson May 18 '17 at 23:26