88

As we use # for inserting comments in Python, then how does Python treat:

# -*- coding: utf-8 -*-

differently?

smci
  • 32,567
  • 20
  • 113
  • 146
ngShravil.py
  • 4,742
  • 3
  • 18
  • 30
  • 1
    this actually happens often with software. They actually do some elementary parsing of comments and look for specific commands. Another example i have in mind is Hypermesh but i am sure there are many more. – Ma0 Jan 16 '17 at 16:12
  • 1
    You can think of it as a preprocessor that runs before the parser that peeks at the file and decides how it should be decoded. Then the parser itself starts and skips the line because its a comment. Some unixy text editors do the same thing to know how the editor should open the file. – tdelaney Jan 16 '17 at 16:16
  • 5
    @Ev.Kounis Perhaps the most prominent example: https://en.wikipedia.org/wiki/Shebang_(Unix) – deceze Jan 16 '17 at 16:17
  • 7
    It's worth noting that the `-*-` parts are completely optional, as far as Python is concerned, but including them seems to be customary. [The docs](https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations) say it "is recognized also by GNU Emacs", which suggests that that's where it comes from (an example of what @tdelaney was saying about text editors), but I've seen it (and used it myself) in code that was never touched by Emacs. – Tim Pederick Jan 16 '17 at 16:43
  • @TimPederick: I've used the VIM variant of the comment often enough. Are you *certain* there wasn't an emacs user on the team somewhere that appreciated having their editor auto-configured when editing the file? – Martijn Pieters Jan 16 '17 at 17:15
  • @MartijnPieters: I feel like I'm admitting to cargo-cult programming, but I've used it myself on my own, personal, *solo* projects, because somewhere I seem to have picked up that it was the customary form. Actually... now that I think about it, it's because when I first started on Python, my (non-Emacs) editor would prompt me to add it! What was it? IDLE? Kate? I think it was Kate... – Tim Pederick Jan 16 '17 at 17:24
  • @TimPederick: no such luck, [kate has their own format](https://docs.kde.org/stable5/en/kate/katepart/config-variables.html). – Martijn Pieters Jan 16 '17 at 17:34
  • 3
    @MartijnPieters: I've got it! **If** you're using IDLE on Python 2, and **if** your file contains non-ASCII characters (as mine often did if I added a copyright line), then it will prompt you to add an encoding declaration, using the Emacs `-*-` style. So that's where I picked it up from. – Tim Pederick Jan 16 '17 at 19:17

2 Answers2

77

Yes, it is also a comment. And the contents of that comment carry special meaning if located at the top of the file, in the first two lines.

From the Encoding declarations documentation:

If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration; the first group of this expression names the encoding of the source code file. The encoding declaration must appear on a line of its own. If it is the second line, the first line must also be a comment-only line.

Note that it doesn't matter what codec should be used to read the file, as far as comments are concerned. Python would normally ignore everything after the # token, and in all accepted source code codecs the #, encoding declaration and line separator characters are encoded exactly the same as they are all supersets of ASCII. So all the parser has to do is read one line, scan for the special text in the comment, read another if needed, scan for the comment, then configure the parser to read data according to the given codec.

Given that the comment is required to be either the first or second in the file (and if it is the second line, the first line must be a comment too), this is entirely safe, as the configured codec can only make a difference to non-comment lines anyway.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 20
    So the real question becomes: why do we use `# -*- coding: X -*-` instead of `# coding: X`? – Jorge Leitao Jan 16 '17 at 17:04
  • 15
    @J.C.Leitão: you don't have to. *Anything that matches the regular expression* would work. But if you are using Emacs as your editor, then that comment also informs that editor what codec to use. – Martijn Pieters Jan 16 '17 at 17:12
  • @MartijnPieters ... are there any other like this, which adds special meaning to the comment line? – ngShravil.py Jan 16 '17 at 18:10
  • 3
    @ShravilPotdar: There's loads. There is the [shebang line](https://en.wikipedia.org/wiki/Shebang_(Unix)) that Unix systems use, and the [Windows `py` launcher](https://docs.python.org/3/using/windows.html#shebang-lines) will look at the same info. As mentioned, many editors can be configured using text in comments (not just what codec to use, but many other aspects as well, see the [emacs](https://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html) and [vim](http://vimdoc.sourceforge.net/htmldoc/options.html#modeline) docs). There are probably more. – Martijn Pieters Jan 16 '17 at 18:25
  • @ShravilPotdar: many linters use config in comments too, see the [pylint manual](https://pylint.readthedocs.io/en/latest/faq.html#message-control), or [flake8](http://flake8.pycqa.org/en/latest/user/ignoring-errors.html#in-line-ignoring-errors). – Martijn Pieters Jan 16 '17 at 18:27
  • @ShravilPotdar: Since comments have no meaning to the *program*, they are easily hijacked by other systems that have to work with the code, basically. – Martijn Pieters Jan 16 '17 at 18:28
  • _in all accepted source code codecs the #, encoding declaration and line separator characters are encoded exactly the same_: Does Python not support UTF-16 source code? – R.M. Jan 16 '17 at 21:33
  • 3
    @R.M.: no, multi-byte codecs are not supported, for this very reason. From [PEP 263](https://www.python.org/dev/peps/pep-0263/): *Any encoding which allows processing the first two lines in the way indicated above is allowed as source code encoding, this includes ASCII compatible encodings as well as certain multi-byte encodings such as Shift_JIS. It does not include encodings which use two or more bytes for all characters like e.g. UTF-16. The reason for this is to keep the encoding detection algorithm in the tokenizer simple.* – Martijn Pieters Jan 16 '17 at 21:38
  • @R.M. Does *any* programming language support multi-byte codecs? I don't mean that to sound aggressive... I've just never heard of such a thing. Since UTF-8 can handle all language characters, and since source code needn't worry about file size (source code is small compared to data), I would not think it necessary to support anything "more complete" than UTF-8. – Mike Williamson Jun 15 '20 at 19:53
  • @MikeWilliamson PowerShell can handle scripts written in UTF-16BE and UTF-16LE – JM0 Aug 10 '23 at 15:42
21

See encoding declarations in the Python Reference Manual:

If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration; the first group of this expression names the encoding of the source code file.

(Emphasis mine)

So yes, it is a comment, a special one. It is special in that the parser will try and act on it and not ignore it as it does for comments not in the first or second line. Take, for example, an unregistered encoding declaration in a sample file decl.py:

# # -*- coding: unknown-encoding -*-
print("foo")

If you try and run this, Python will try and process it, fail and complain:

python decl.py 
  File "decl.py", line 1
SyntaxError: encoding problem: unknown-encoding
Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
  • 3
    But if you were to register `unkown-encoding` as an encoding, say, with a `.pth` file, then that codec is actually loaded and used. This provides a very nice and interesting opportunity for pre-parse code processing. – Martijn Pieters Jan 16 '17 at 16:20
  • Indeed @MartijnPieters I mainly added that as a code example that Python processes the declaration, not to make any other claims for it. – Dimitris Fasarakis Hilliard Jan 16 '17 at 16:22
  • 1
    https://github.com/dropbox/pyxl would be an example of what @MartijnPieters is referring to. – Łukasz Rogalski Jan 16 '17 at 16:39