1

I noticed that without source code encoding declaration, the Python 2 interpreter assumes the source code is encoded in ASCII with scripts and standard input:

$ python test.py  # where test.py holds the line: print u'é'
  File "test.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

$ echo "print u'é'" | python
  File "/dev/fd/63", line 1
SyntaxError: Non-ASCII character '\xc3' in file /dev/fd/63 on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

and it is encoded in ISO-8859-1 with the -m module and -c command flags:

$ python -m test  # where test.py holds the line: print u'é'
é

$ python -c "print u'é'"
é

Where is it documented?

Contrast this to Python 3 which always assumes the source code is encoded in UTF-8 and thus prints é in the four cases.

Note. – I tested this on CPython 2.7.14 on both macOS 10.13 and Ubuntu Linux 17.10 with the console encoding set to UTF-8.

Géry Ogam
  • 6,336
  • 4
  • 38
  • 67

1 Answers1

2

The -c and -m switches, ultimately(*) run the code supplied with the exec statement or the compile() function, both of which take Latin-1 source code:

The first expression should evaluate to either a Unicode string, a Latin-1 encoded string, an open file object, a code object, or a tuple.

This is not documented, it's an implementation detail, that may or may not be considered a bug.

I don't think it is something that is worth fixing however, and Latin-1 is a superset of ASCII so little is lost. How code from -c and -m is handled has been cleaned up in Python 3 and is much more consistent there; code passed in with -c is decoded using the current locale, and modules loaded with the -m switch default to UTF-8, as usual.


(*) If you want to know the exact implementations used, start at the Py_Main() function in Modules/main.c, which handles both -c and -m as:

if (command) {
    sts = PyRun_SimpleStringFlags(command, &cf) != 0;
    free(command);
} else if (module) {
    sts = RunModule(module, 1);
    free(module);
}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Another thing, Python 2 documentation [states](https://docs.python.org/2.7/tutorial/interpreter.html#the-interpreter-and-its-environment): "By default, Python source files are treated as encoded in UTF-8." My first test above shows that ASCII is the default, so this is a documentation error, right? – Géry Ogam Feb 27 '18 at 07:38
  • @Maggyero: oh boy, yes, that's an error in the tutorial. The language reference is the official source, see the [*Lexical Analysis* section](https://docs.python.org/2/reference/lexical_analysis.html): *Python uses the 7-bit ASCII character set for program text.* – Martijn Pieters Feb 27 '18 at 08:52
  • 1
    @Maggyero: I found the [incorrect revision](https://github.com/python/cpython/commit/40ba60f6bf2f7192f86da395c71348d0fa24da09), that error was introduced recently. I've filed [a Python bug report](https://bugs.python.org/issue32963) to have that corrected. – Martijn Pieters Feb 27 '18 at 08:59