How to represent vertical alignment of syntax of code using BNF, EBNF or etc.?

Question

How to say that (in BNF, EBNF, etc) any two or more letters are placed in the same vertical alignment

e.g In python 2.x, we have what we call indentation.

def hello():
    print "hello," 
    print "world"

hello()

Note letter p (second line) is placed in the same vertical alignment of letter p (third line)

Further example (in markdown):

MyHeader
========
topic
-----

Note M and the first = are placed in the same vertical alignment (also r and last =, t and first -, c and last -)

My question is How to represent these vertical alignment of letters using BNF, EBNF or etc.?

Further note: My point of this question is searching for a representation method to represent a vertical alignment of code, not just want to know how to write BNF or EBNF of Python or Markdown.

Given that BNF is capable of representing *context-free* grammars, you might have a bit of difficulty using them to represent a grammar that is not, in fact, context-free... `python` is not entirely context free; parts of it are, but the language as a whole is not. — twalberg, Jan 05 '15 at 19:12
Any other ways you can answer me below i just want the solution that can represent the vertical alignment of code. — fronthem, Jan 05 '15 at 19:14

score 10 · Accepted Answer · edited Jun 20 '20 at 09:12

You can parse an indentation-sensitive language (like Python or Haskell) by using a little hack, which is well-described in the Python language reference's chapter on lexical analysis. As described, the lexical analyzer turns leading whitespace into INDENT and DEDENT tokens [Note 1], which are then used in the Python grammar in a straightforward fashion. Here's a small excerpt:

suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement     ::=  stmt_list NEWLINE | compound_stmt
stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]
while_stmt    ::=  "while" expression ":" suite ["else" ":" suite]

So if you are prepared to describe (or reference) the lexical analysis algorithm, the BNF is simple.

However, you cannot actually write that algorithm as a context free grammar, because it is not context-free. (I'll leave out the proof, but it's similar to the proof that aⁿbⁿcⁿ is not context free, which you can find in most elementary formal language textbooks, and all over the internet.)

ISO standard EBNF (a free PDF is available) provides a way of including "extensions which a user may require": a Special-sequence, which is any text not containing a ? surrounded on both sides by a ?. So you could abuse the notation by including [Note 2]:

DEDENT = ? See section 2.1.8 of https://docs.python.org/3.3/reference/ ? ;

Or you could insert a full description of the algorithm. Of course, neither of those techniques will allow a parser generator to produce an accurate lexical analyzer, but it would be a reasonable way of communicating intent to a human reader.

It's worth noting that EBNF itself uses a special sequence to define one of its productions:

(* see 4.7 *) syntactic exception
   = ? a syntactic-factor that could be replaced
       by a syntactic-factor containing no
       meta-identifiers
     ? ;

Notes

The lexical analyzer also converts some physical newline characters into NEWLINE tokens, while making other newline characters vanish.
EBNF normally uses the syntax = rather than ::= for a production, and insists that they be terminated with ;. Comments are enclosed between (* and *).

I'm not quite clear... does it mean that the result after pre-processing with lexical-analyzer can be described by context-free grammar? but that what the lexical-analyzer itself does cannot be? — Anentropic, Sep 26 '18 at 17:02
@anentropic: That's what it means. The lexical analyser needs to have a stack of indent positions, and you cannot represent such a stack in a CFG. — rici, Sep 26 '18 at 17:04

How to represent vertical alignment of syntax of code using BNF, EBNF or etc.?

1 Answers1

Notes