25

When parsing a freeform language like C, it is easy for the parser to determine when several expressions are related to one another simply by looking at the symbols emitted by the parser. For example, in the code

if (x == 5) {
    a = b;
    c = d;
}

The parser can tell that a = b; and c = d; are part of the same block statement because they're surrounded by braces. This could easily be encoded as a CFG using something like this:

STMT        ::=  IF_STMT | EXPR; | BLOCK_STMT | STMT STMT
IF_STMT     ::=  if ( EXPR ) STMT
BLOCK_STMT  ::=  { STMT }

In Python and other whitespace-sensitive languages, though, it's not as easy to do this because the structure of the statements can only be inferred from their absolute position, which I don't think can easily be encoded into a CFG. For example, the above code in Python would look like this:

if x == 5:
    a = b
    c = d

Try as I might, I can't see a way to write a CFG that would accept this, because I can't figure out how to encode "two statements at the same level of nesting" into a CFG.

How do Python parsers group statements as they do? Do they rely on a scanner that automatically inserts extra tokens denoting starts and ends of statements? Do they produce a rough AST for the program, then have an extra pass that assembles statements based on their indentation? Is there a clever CFG for this problem that I'm missing? Or do they use a more powerful parser than a standard LL(1) or LALR(1) parser that's able to take whitespace level into account?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065

1 Answers1

25

The indentations are handled with two "pseudo tokens" - INDENT and DEDENT. There are some details here. For more information, you should look at the source for the python tokeniser and parser.

Noufal Ibrahim
  • 71,383
  • 13
  • 135
  • 169
  • I had no idea that the Python docs were so detailed! Thanks so much for this answer, and especially for providing it so quickly! – templatetypedef Jun 21 '11 at 18:57
  • 1
    Python has some of the best documentation I've seen for an open source project. That culture has spread to python projects in general. As for the quickness, this is something that I was curious about as as well when I first used the language so I looked it up. :) – Noufal Ibrahim Jun 21 '11 at 18:58
  • 1
    Oh yes. One more thing. Python uses a simple LL parser to parse source. This keep the grammar simple and prevents people from getting ideas. – Noufal Ibrahim Jun 21 '11 at 19:15
  • Our non-Python based parsers of Python produce INDENT/DEDENT tokens as well. We use a different parser than LL. – Ira Baxter Jun 22 '11 at 02:13
  • 1
    Ira: That's fine. I'm just saying that the "standard" (i.e. CPython) parser is LL and that keeps the "official" grammar uncomplicated. You can do more damage with an LR or earley parser than you do with an LL since the grammars you can write can be more complex. – Noufal Ibrahim Jun 22 '11 at 06:04
  • @Noufal: I really like this fact about Python. It's nice to feel like you have the option of parsing your code and programmatically changing it. – Neil G Jun 23 '11 at 05:07
  • Neil: You should check out the [ast](http://docs.python.org/library/ast.html) module then. Also, while it's "nice", it's not necessarily a good thing. – Noufal Ibrahim Jun 23 '11 at 05:46