Why does Python's grammar specification not include docstrings and comments?

Question

I am consulting the official Python grammar specification as of Python 3.6.

I am unable to find any syntax for comments (they appear prepended with a #) and docstrings (they should appear with '''). A quick look at the lexical analysis page didn't help either - docstrings are defined there as longstrings but do not appear in the grammar specifications. A type named STRING appears further, but no reference to its definition takes place.

Given this, I am curious about how the CPython compiler knows what comments and docstrings are. How is this feat accomplished?

I originally guessed that comments and docstrings are removed in a first pass by the CPython compiler, but then that beggars the question of how help() is able to render the relevant docstrings.

Martijn Pieters · Answer 1 · 2017-06-23T11:07:31.697

A docstring is not a separate grammar entity. It is just a regular simple_stmt (following that rule down all the way to atom and STRING+ ^*. If it is the first statement in a function body, class or module, then it used as the docstring by the compiler.

This is documented in the reference documentation as footnotes to the class and def compound statements:

[3] A string literal appearing as the first statement in the function body is transformed into the function’s __doc__ attribute and therefore the function’s docstring.

[4] A string literal appearing as the first statement in the class body is transformed into the namespace’s __doc__ item and therefore the class’s docstring.

There currently is no reference documentation that specifies the same for modules, I regard this as a documentation bug.

Comments are removed by the tokenizer and never need to be parsed as grammar. Their whole point is to not have meaning on a grammar level. See the Comments section of the Lexical Analysis documentation:

A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax; they are not tokens.

Bold emphasis mine. So the tokenizer skips comments altogether:

/* Skip comment */
if (c == '#') {
    while (c != EOF && c != '\n') {
        c = tok_nextc(tok);
    }
}

Note that Python source code goes through 3 steps:

Tokenizing
Parsing
Compilation

The grammar only applies to the parsing stage; comments are dropped in the tokenizer, and docstrings are only special to the compiler.

To illustrate how the parser doesn't treat docstrings as anything other than a string literal expression, you can access any Python parse results as an Abstract Syntax Tree, via the ast module. This produces Python objects that directly reflect the parse tree that the Python grammar parser produces, from which Python bytecode is then compiled:

>>> import ast
>>> function = 'def foo():\n    "docstring"\n'
>>> parse_tree = ast.parse(function)
>>> ast.dump(parse_tree)
"Module(body=[FunctionDef(name='foo', args=arguments(args=[], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Expr(value=Str(s='docstring'))], decorator_list=[], returns=None)])"
>>> parse_tree.body[0]
<_ast.FunctionDef object at 0x107b96ba8>
>>> parse_tree.body[0].body[0]
<_ast.Expr object at 0x107b16a20>
>>> parse_tree.body[0].body[0].value
<_ast.Str object at 0x107bb3ef0>
>>> parse_tree.body[0].body[0].value.s
'docstring'

So you have FunctionDef object, which has, as the first element in the body, an expression that is a Str with value 'docstring'. It is the compiler that then generates a code object, storing that docstring in a separate attribute.

You can compile the AST into bytecode with the compile() function; again, this is using the actual codepaths the Python interpreter uses. We'll use the dis module to decompile the bytecode for us:

>>> codeobj = compile(parse_tree, '', 'exec')
>>> import dis
>>> dis.dis(codeobj)
  1           0 LOAD_CONST               0 (<code object foo at 0x107ac9d20, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

So the compiled code produced the top-level statements for a module. The MAKE_FUNCTION opcode uses a stored codeobject (part of the top-level code object constants) to build a function. So we look at that nested code object, at index 0:

>>> dis.dis(codeobj.co_consts[0])
  1           0 LOAD_CONST               1 (None)
              2 RETURN_VALUE

Here the docstring appears to be gone. The function does nothing more than return None. The docstring is instead stored as a constant:

>>> codeobj.co_consts[0].co_consts
('docstring', None)

When executing the MAKE_FUNCTION opcode, it is that first constant, provided it is a string, that is turned into the __doc__ attribute for the function object.

Once compiled, we can execute the code object with the exec() function into a given namespace, which adds a function object with a docstring:

>>> namespace = {}
>>> exec(codeobj, namespace)
>>> namespace['foo']
<function foo at 0x107c23e18>
>>> namespace['foo'].__doc__
'docstring'

So it's the job of the compiler to determine when something is a docstring. This is done in C code, in the compiler_isdocstring() function:

static int
compiler_isdocstring(stmt_ty s)
{
    if (s->kind != Expr_kind)
        return 0;
    if (s->v.Expr.value->kind == Str_kind)
        return 1;
    if (s->v.Expr.value->kind == Constant_kind)
        return PyUnicode_CheckExact(s->v.Expr.value->v.Constant.value);
    return 0;
}

This is called from locations where a docstring makes sense; for modules and classes, in compiler_body(), and for functions, in compiler_function().

TLDR: comments are not part of the grammar, because the grammar parser never even sees comments. They are skipped by the tokenizer. Docstrings are not part of the grammar, because to the grammar parser they are just string literals. It is the compilation step (taking the parse tree output of the parser) that interprets those string expressions as docstrings.

^* The full grammar rule path is simple_stmt -> small_stmt -> expr_stmt -> testlist_star_expr -> star_expr -> expr -> xor_expr -> and_expr -> shift_expr -> arith_expr -> term -> factor -> power -> atom_expr -> atom -> STRING+

Is there a "true" full grammar specification somewhere? What if I want to look up how Python comments look? And is anything else missing from that "Full Grammar specification" page? — Stefan Pochmann, Jun 23 '17 at 09:50
I tried getting a full picture by googling `python tokenizer` but only found the [`tokenize`](https://docs.python.org/3/library/tokenize.html) module. That must not be it, as its documentation says *"The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays."* — Stefan Pochmann, Jun 23 '17 at 09:55
@StefanPochmann: the `tokenize` module echoes the [C implementation](https://github.com/python/cpython/blob/master/Parser/tokenizer.c) with added niceties like treating comments as tokens *anyway*. — Martijn Pieters, Jun 23 '17 at 09:57
@StefanPochmann: the full grammar is the [full grammar](https://github.com/python/cpython/blob/master/Grammar/Grammar) used as input to the [`pgen` parser generator](https://github.com/python/cpython/blob/master/Parser/pgen.c). See http://eli.thegreenplace.net/2010/06/30/python-internals-adding-a-new-statement-to-python/ if you want to learn how this all works. — Martijn Pieters, Jun 23 '17 at 10:01

cs95 · Accepted Answer · 2017-06-23T12:38:43.877

Section 1

What happens to comments?

Comments (anything preceded by a #) are ignored during tokenization/lexical analysis, so there is no need to write rules to parse them. They do not provide any semantic information to the interpreter/compiler, since they only serve to improve the verbosity of your program for the reader's sake, and so they are ignored.

Here's the lex specification for the ANSI C programming language: http://www.quut.com/c/ANSI-C-grammar-l-1998.html. I'd like to draw your attention to the way comments are being processed here:

"/*"            { comment(); }
"//"[^\n]*      { /* consume //-comment */ }

Now, take a look at the rule for int.

"int"           { count(); return(INT); }

Here's the lex function to process int and other tokens:

void count(void)
{
    int i;

    for (i = 0; yytext[i] != '\0'; i++)
        if (yytext[i] == '\n')
            column = 0;
        else if (yytext[i] == '\t')
            column += 8 - (column % 8);
        else
            column++;

    ECHO;
}

You see here it ends with the ECHO statement, meaning it is a valid token and must be parsed.

Now, here's the lex function to process comments:

void comment(void)
{
    char c, prev = 0;

    while ((c = input()) != 0)      /* (EOF maps to 0) */
    {
        if (c == '/' && prev == '*')
            return;
        prev = c;
    }
    error("unterminated comment");
}

There's no ECHO here. So, nothing is returned.

This is a representative example, but python does the exact same thing.

Section 2

What happens to docstrings?

Note: This section of my answer is meant to be a complement to @MartijnPieters' answer. It is not meant to replicate any of the information he has furnished in his post. Now, with that said,...

I originally guessed that comments and docstrings are removed in a first pass by the CPython compiler[...]

Docstrings (string literals that are not assigned to any variable name, anything within '...', "...", '''...''', or """...""") are indeed processed. They are parsed as simple string literals (STRING+ token), as Martijn Pieters mentions in his answer. As of the current docs, it is only mentioned in passing that docstrings are assigned to the function/class/module's __doc__ attribute. How it is done is not really mentioned in depth anywhere.

What actually happens is that they are tokenised and parsed as string literals and the resultant parse tree generated will contain them. From the parse tree the byte code is generated, with the docstrings in their rightful place in the __doc__ attribute (they are not explicitly a part of the byte code as illustrated below). I won't go into details since the answer I linked above describes the same in very nice detail.

Of course, it is possible to ignore them completely. If you use python -OO (the -OO flag stands for "optimize intensely", as opposed to -O which stands for "optimize mildly"), with the resultant byte code stored in .pyo files, which exclude the docstrings.

An illustration can be seen below:

Create a file test.py with the following code:

def foo():
    """ docstring """
    pass

Now, we'll compile this code with the normal flags set.

>>> code = compile(open('test.py').read(), '', 'single')
>>> import dis
>>> dis.dis(code)
  1           0 LOAD_CONST               0 (<code object foo at 0x102b20ed0, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

As you can see, there is no mention of our docstring in the byte code. However, they are there. To get the docstring, you can do...

>>> code.co_consts[0].co_consts
(' docstring ', None)

So, as you can see, the docstring does remain, just not as a part of the main bytecode. Now, let's recompile this code, but with the optimisation level set to 2 (equivalent of the -OO switch):

>>> code = compile(open('test.py').read(), '', 'single', optimize=2)
>>> dis.dis(code)
  1           0 LOAD_CONST               0 (<code object foo at 0x102a95810, file "", line 1>)
              2 LOAD_CONST               1 ('foo')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

No, difference, but...

>>> code.co_consts[0].co_consts
(None,)

The docstrings have gone now.

The -O and -OO flag only remove things (optimisation of byte code is done by default... -O removes assert statements and if __debug__: suites from the generated bytecode, while -OO ignores docstrings in addition). The resultant compile time will decrease slightly. In addition, the speed of execution remains the same, unless you have a large amount of assert and if __debug__: statements, otherwise making no difference to performance.

Also, do remember that the docstrings are preserved only if they are the first thing in the function/class/module definition. All additional strings are simply dropped during compilation. If you change test.py to the following:

def foo():
    """ docstring """

    """test"""
    pass

And then repeat the same process with optimization=0, this is is stored in the co_consts variable upon compilation:

>>> code.co_consts[0].co_consts
(' docstring ', None)

Meaning, """ test """ has been ignored. It'll interest you to know that this removal is done as part of the base optimisation on the byte code.

Section 3

Additional reading

(You may find these references as interesting as I did.)

What does Python optimization (-O or PYTHONOPTIMIZE) do?
What do the python file extensions, .pyc .pyd .pyo stand for?
Are Python docstrings and comments stored in memory when a module is loaded?
Working with compile()
The dis module
peephole.c (courtesy Martijn) - The source code for all compiler optimisations. This is particularly fascinating, if you can understand it!