4

Before you go telling me to read PEP 0263, keep reading...

I can't find any documentation that details which file encodings are supported for Python 3 source files.

I've found hundreds (thousands?) of questions, answers, posts, emails, etc. about how to declare - at the top of your source file - the encoding of that source file, but none of them answer my question. Bear with me and imagine doing (or actually try) the following:

  1. Open Notepad (I'm using regular old Notepad on Windows 7, but I doubt it matters; I'm sure your superior editor can do something similar.)
  2. Type your favorite line of Python code ( I used print( 'Hello, world!' ) )
  3. Select "File" -> "Save"
  4. Select a folder and file name ( I used "E:\Temp\hello.py" )
  5. Change the "Encoding:" setting from the default "ANSI" to "Unicode"
  6. Press "Save"
  7. Open a command prompt, change to the folder containing your new file, and try to run it

Here's the output I get:

E:\Temp>python --version
Python 3.4.1

E:\Temp>python "hello.py"
  File "hello.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file hello.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Now, when I open this same file in Notepad++ and look at the "Encoding" menu, it has the option "Encode in UCS-2 Little Endian" selected. Wikipedia tells me that this is basically UTF-16 encoding. Whatever. I don't really care. More research reveals that my editor has inserted a two-byte BOM (Byte Order Mark) with a value of '\xff\xfe' at the front of the file to indicate the file encoding. So at least I know where the '\xff' code that Python is complaining about comes from.

So I go and read PEP 0263 - and everything else regarding it - on the web, and I try adding a comment like this to the first line of the file

# coding: utf-16

with all sorts of different values for the encoding, and nothing helps. But it can't help, right? Because Python isn't even getting as far as my encoding declaration; It's choking on the first byte of the source file!

So what I really want to know is...

  1. Why can't the Python 3 interpreter read this file?
  2. If "Unicode" or "UCS-2 Little Endian" or "UTF-16" or whatever isn't supported, what is???

P.S. I even found another question on StackOverflow which seems to be the exact issue I'm having, but it was closed - erroneously in my opinion - as a duplicate. :(

--- EDIT ---

Someone asked for my "compiled options". Here's some output. Maybe it will help?

E:\Temp>python
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:38:22) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sysconfig
>>> print( sysconfig.get_config_vars() )
{'EXT_SUFFIX': '.pyd', 'srcdir': 'C:\\Python34', 'py_version_short': '3.4', 'base': 'C:\\Python34', 'prefix': 'C:\\Python34', 'projectbase': 'C:\\Python34', 'INCLUDEPY': 'C:\\Python34\\Include', 'platbase': 'C:\\Python34', 'py_version_nodot': '34', 'exec_prefix': 'C:\\Python34', 'EXE': '.exe', 'installed_base': 'C:\\Python34', 'SO': '.pyd', 'installed_platbase': 'C:\\Python34', 'VERSION': '34', 'BINLIBDEST': 'C:\\Python34\\Lib', 'LIBDEST': 'C:\\Python34\\Lib', 'userbase': 'C:\\Users\\alonghi\\AppData\\Roaming\\Python', 'py_version': '3.4.1', 'abiflags': '', 'BINDIR': 'C:\\Python34'}
>>>
Community
  • 1
  • 1
aldo
  • 2,927
  • 21
  • 36
  • Can you post your entire hello.py file, from top to bottom, including the "shebang" `#!/bin/env python` or whatever. Also, your compiled options may help: `import sysconfig; print(sysconfig.get_config_vars())` – jedwards Oct 01 '14 at 00:13
  • @jedwards The file contains a single line of code, as stated. – aldo Oct 01 '14 at 00:25
  • @also, thanks for the "clarification", but it doesn't help much. That being said, maybe consult [this](https://docs.python.org/2/library/codecs.html#standard-encodings). I have no idea whether it's the list you're interested in, but it seems plausable. Good luck with your question ... – jedwards Oct 01 '14 at 00:30
  • "But it can't help, right? Because Python isn't even getting as far as my encoding declaration; It's choking on the first byte of the source file!" Yes, because UTF-16 encoding uses bytes that can't be understood using the default encoding (ASCII in Python 2; UTF-8 in Python 3). "Why can't the Python 3 interpreter read this file?" Because it has to be able to read the encoding declaration before it could switch that encoding. – Karl Knechtel Mar 30 '23 at 11:14
  • "If "Unicode" or "UCS-2 Little Endian" or "UTF-16" or whatever isn't supported, what is???" Ones in which the coding declaration would match the **byte** regex described in PEP 263, as described by the text of PEP 263. This falls out automatically from the fact that the PEP was authored wayyyyyy back in 2001, when `str` meant a sequence of bytes that was only pretending to be a string. – Karl Knechtel Mar 30 '23 at 11:16
  • I agree that the PEP isn't explicit about this, and I can't find official documentation for it even in 2023. However, this information belongs at the canonical - and as it happens, I wrote an answer there that covers this sub-topic a little while ago. – Karl Knechtel Mar 30 '23 at 11:19

1 Answers1

7

A source encoding must be:

  1. An encoding supported by the version of Python in question. (This varies by version and platform, for example you only get mbcs on Windows.)

  2. Loosely ASCII-compatible, enough that the # coding: declaration can be read using ascii which is the initial source encoding before any declaration is read. See PEP0263 ‘Concepts’ item 1.

The encoding that Windows misleadingly calls “Unicode”, UTF-16LE, is not ASCII-compatible (and generally is a barrel of problems you should try to avoid using). Python would need special encoding-specific support to detect UTF-16 source files and this feature has been declined for now.

The # coding: you should use is almost invariably UTF-8.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • So the answer *was* there in PEP0263 ('Concepts' item 1): "It does not include encodings which use two or more bytes for all characters like e.g. UTF-16." Thanks for that. This requirement is not spelled out very clearly anywhere that I have found, a complaint repeated in the bug/issue/feature-request you pointed out ("Cannot write source code in UTF16"). Thanks for that reference, too. Much appreciated! – aldo Oct 01 '14 at 16:28
  • 1
    Python3 code is unicode. When reading bytes from an external source, the interpreter assume UTF-8 encoding unless the first line after an optional #! line says otherwise. Similarly, Idle writes with utf-8 encoding unless directed otherwise. So an explicit UTF-8 should not be needed. – Terry Jan Reedy Oct 02 '14 at 06:24