You can parse an indentation-sensitive language (like Python or Haskell) by using a little hack, which is well-described in the Python language reference's chapter on lexical analysis. As described, the lexical analyzer turns leading whitespace into INDENT
and DEDENT
tokens [Note 1], which are then used in the Python grammar in a straightforward fashion. Here's a small excerpt:
suite ::= stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement ::= stmt_list NEWLINE | compound_stmt
stmt_list ::= simple_stmt (";" simple_stmt)* [";"]
while_stmt ::= "while" expression ":" suite ["else" ":" suite]
So if you are prepared to describe (or reference) the lexical analysis algorithm, the BNF is simple.
However, you cannot actually write that algorithm as a context free grammar, because it is not context-free. (I'll leave out the proof, but it's similar to the proof that anbncn
is not context free, which you can find in most elementary formal language textbooks, and all over the internet.)
ISO standard EBNF (a free PDF is available) provides a way of including "extensions which a user may require": a Special-sequence
, which is any text not containing a ? surrounded on both sides by a ?. So you could abuse the notation by including [Note 2]:
DEDENT = ? See section 2.1.8 of https://docs.python.org/3.3/reference/ ? ;
Or you could insert a full description of the algorithm. Of course, neither of those techniques will allow a parser generator to produce an accurate lexical analyzer, but it would be a reasonable way of communicating intent to a human reader.
It's worth noting that EBNF itself uses a special sequence to define one of its productions:
(* see 4.7 *) syntactic exception
= ? a syntactic-factor that could be replaced
by a syntactic-factor containing no
meta-identifiers
? ;
Notes
The lexical analyzer also converts some physical newline characters into NEWLINE
tokens, while making other newline characters vanish.
EBNF normally uses the syntax =
rather than ::=
for a production, and insists that they be terminated with ;
. Comments are enclosed between (*
and *)
.