Due to the recently discovered Unicode trojan source attacks (also described in PEP 672), I looked a bit deeper into the Unicode/character encoding behaviour of Python Scripts. In Python 2 Unicode encoding had to be enabled with a specific encoding line as defined in PEP 263 and beginning with Python 3 UTF-8 was set as default encoding for Python files (PEP 3120).
To detect (all) possible trojan source attacks I wanted to create example python scripts in UTF-16 and UTF-32, as test files for a custom linter.
Now PEP 263 defines that the first line (or second if the first is #!/usr/bin/python
) should not contain any non ASCII char. Instead it expects a special line # -*- coding: <encoding name> -*-
first.
This is conflicting with the Unicode Standard (see FAQ and SO question) defining that UTF-16 and UTF-32 shall contain a BOM marker for UTF-16 and UTF-32.
Now if I create a UTF-16 (or UTF-32) file with BOM marker, Python will complain:
SyntaxError: Non-UTF-8 code starting with '\xff' in file test_utf_16.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
On the other hand, if I omit the BOM marker, neither vim
nor PyCharm
nor Gitlab will display the code correctly (if I don't change the encoding manually):
To make the situation even more absurd, if I encode a file with UTF-16 or UTF-32 Python (3.10) does not complain but also does not execute the any print commands (even if print only contains ASCII).
So long story short:
How to create valid and working Python 3 script files in UTF-16 or UTF-32, that can be edited with vim or PyCharm (out of the box)?
Update: I also figured out that tokenize.detect_encoding
which implements PEP 263 isn't able to determine the correct encoding even so if the string is within the file. Because in UTF-16/UTF-32 the first line would not match the ASCII regex anymore.
Therefore it seems to be very unlikely that it is possible to generate a valid python script encoded in UTF-16 or UTF-32. Or is it?
Appendix
Python code to create test files
# create_examples.py
from pathlib import Path
base_path = Path(__file__).parent.absolute()
def write_encoding(enc: str, strip_endian: bool = False):
encoding = enc.lower()
if encoding.endswith('be') or encoding.endswith('le'):
# As defined in unicode, if the le or the be is given, a BOM is not written
bom_used = False
if strip_endian:
enc = enc[:-3]
else:
bom_used = True
suffix = enc
if bom_used:
suffix += "_bom"
name = f"test_{encoding}_{suffix}.py"
content = f"# vim: set fileencoding={enc} :\n"
content += f"print(\"{name}\")\n"
content += 's = "x" * 100 # "x" is assigned'
content += "\n"
path = base_path / name
path.write_bytes(content.encode(encoding))
write_encoding('utf_8')
write_encoding('utf_16')
write_encoding('utf_16_be', False)
write_encoding('utf_16_be', True)
write_encoding('utf_16_le', False)
write_encoding('utf_16_le', True)
write_encoding('utf_32')
write_encoding('utf_32_be', False)
write_encoding('utf_32_be', True)
write_encoding('utf_32_le', False)
write_encoding('utf_32_le', True)
Execute all test files with
for i in $(ls test_*); do python $i; done