1

Due to the recently discovered Unicode trojan source attacks (also described in PEP 672), I looked a bit deeper into the Unicode/character encoding behaviour of Python Scripts. In Python 2 Unicode encoding had to be enabled with a specific encoding line as defined in PEP 263 and beginning with Python 3 UTF-8 was set as default encoding for Python files (PEP 3120).

To detect (all) possible trojan source attacks I wanted to create example python scripts in UTF-16 and UTF-32, as test files for a custom linter.

Now PEP 263 defines that the first line (or second if the first is #!/usr/bin/python) should not contain any non ASCII char. Instead it expects a special line # -*- coding: <encoding name> -*- first.

This is conflicting with the Unicode Standard (see FAQ and SO question) defining that UTF-16 and UTF-32 shall contain a BOM marker for UTF-16 and UTF-32.

Now if I create a UTF-16 (or UTF-32) file with BOM marker, Python will complain:

SyntaxError: Non-UTF-8 code starting with '\xff' in file test_utf_16.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

On the other hand, if I omit the BOM marker, neither vim nor PyCharm nor Gitlab will display the code correctly (if I don't change the encoding manually):

Screenshot of PyCharm and VIM displaying NUL characters

To make the situation even more absurd, if I encode a file with UTF-16 or UTF-32 Python (3.10) does not complain but also does not execute the any print commands (even if print only contains ASCII).

So long story short:

How to create valid and working Python 3 script files in UTF-16 or UTF-32, that can be edited with vim or PyCharm (out of the box)?

Update: I also figured out that tokenize.detect_encoding which implements PEP 263 isn't able to determine the correct encoding even so if the string is within the file. Because in UTF-16/UTF-32 the first line would not match the ASCII regex anymore.

Therefore it seems to be very unlikely that it is possible to generate a valid python script encoded in UTF-16 or UTF-32. Or is it?

Appendix

Python code to create test files

# create_examples.py
from pathlib import Path

base_path = Path(__file__).parent.absolute()


def write_encoding(enc: str, strip_endian: bool = False):
    encoding = enc.lower()
    if encoding.endswith('be') or encoding.endswith('le'):
        # As defined in unicode, if the le or the be is given, a BOM is not written
        bom_used = False
        if strip_endian:
            enc = enc[:-3]
    else:
        bom_used = True
    suffix = enc
    if bom_used:
        suffix += "_bom"
    name = f"test_{encoding}_{suffix}.py"
    content = f"# vim: set fileencoding={enc} :\n"
    content += f"print(\"{name}\")\n"
    content += 's = "x‏" * 100  #    "‏x" is assigned'
    content += "\n"
    path = base_path / name
    path.write_bytes(content.encode(encoding))

write_encoding('utf_8')
write_encoding('utf_16')
write_encoding('utf_16_be', False)
write_encoding('utf_16_be', True)
write_encoding('utf_16_le', False)
write_encoding('utf_16_le', True)
write_encoding('utf_32')
write_encoding('utf_32_be', False)
write_encoding('utf_32_be', True)
write_encoding('utf_32_le', False)
write_encoding('utf_32_le', True)

Execute all test files with for i in $(ls test_*); do python $i; done

Kound
  • 1,835
  • 1
  • 17
  • 30
  • 1
    See `:help encoding-names` in Vim. – romainl Nov 09 '21 at 12:59
  • As I tried to explain in the Question: I would expect PyCharm (and vim) to follow PEP 273 to determine the codec correctly without me needing to set it manually. Of course once I change it manually PyCharm shows it correctly. But once I save it, PyCharm writes a BOM, that you can't remove. It says "This file has mandatory BOM" -> so it understands Unicode the way I do but Python doesn't. – Kound Nov 09 '21 at 13:07
  • I just tried this an fell flat on my face. – bad_coder Nov 09 '21 at 14:18
  • 2
    https://stackoverflow.com/a/26136961/7976758 **Python source code cannot be in UTF-16/32 encodings.** https://stackoverflow.com/a/28204682/7976758 Found in https://stackoverflow.com/search?q=%5Bpython%5D+source+utf-16 The list of recommended source code encodings is: ASCII, Latin-1, UTF-8. – phd Nov 09 '21 at 14:26
  • @phd would you be so nice and create an answer from this? Otherwise I would answer myself. Thanks for figuring that out, sometimes it is only about finding the correct search terms. – Kound Nov 09 '21 at 14:30

0 Answers0