When parsing a freeform language like C, it is easy for the parser to determine when several expressions are related to one another simply by looking at the symbols emitted by the parser. For example, in the code
if (x == 5) {
a = b;
c = d;
}
The parser can tell that a = b;
and c = d;
are part of the same block statement because they're surrounded by braces. This could easily be encoded as a CFG using something like this:
STMT ::= IF_STMT | EXPR; | BLOCK_STMT | STMT STMT
IF_STMT ::= if ( EXPR ) STMT
BLOCK_STMT ::= { STMT }
In Python and other whitespace-sensitive languages, though, it's not as easy to do this because the structure of the statements can only be inferred from their absolute position, which I don't think can easily be encoded into a CFG. For example, the above code in Python would look like this:
if x == 5:
a = b
c = d
Try as I might, I can't see a way to write a CFG that would accept this, because I can't figure out how to encode "two statements at the same level of nesting" into a CFG.
How do Python parsers group statements as they do? Do they rely on a scanner that automatically inserts extra tokens denoting starts and ends of statements? Do they produce a rough AST for the program, then have an extra pass that assembles statements based on their indentation? Is there a clever CFG for this problem that I'm missing? Or do they use a more powerful parser than a standard LL(1) or LALR(1) parser that's able to take whitespace level into account?