UnicodeEncodeError when using the compile function

Question

Using python 3.2 in Windows 7 I am getting the following in IDLE:

>>compile('pass', r'c:\temp\工具\module1.py', 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.

for example:

f = open(r'c:\temp\工具\module1.py')

works.

For a more complete test save the following in a utf8 encoded file and run it using the standard python.exe version 3.2

# -*- coding: utf8 -*-
fname = r'c:\temp\工具\module1.py'
# I do have the a file named fname but you can comment out the following two lines
f = open(fname)
print('ok')
cmp = compile('pass', fname, 'exec')
print(cmp)

Output:

ok
Traceback (most recent call last):
  File "module8.py", line 6, in <module>
    cmp = compile('pass', fname, 'exec')
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: inval
id character

tried locally in XP and get a proper code object back. Is this being run from the CLI or is this run via a file? — monkut, Jan 10 '12 at 05:56
I'm going to guess that it's not the call signature that's the problem, but the content of the file that is causing the unicode error. check to make sure that "module1.py" is correctly encoded, with the encoding signature assigned. — monkut, Jan 10 '12 at 06:24
@monkut: In Python 3.x, you don't have to worry about encoding - if there are UTF-8 characters in the file, then they'll be rendered as UTF-8 characters. — Makoto, Jan 10 '12 at 06:26
hmmmm... still seems like an encoding issue with "module1.py". Perhaps the sig is set to "mbcs" overriding the default? — monkut, Jan 10 '12 at 06:38
See the edited version of the question. The compile function should not care whether the filename exists or its encoding. The source code is passed as a unicode string as the first argument. — PyScripter, Jan 10 '12 at 09:52
It is when unicode filenames are provided that the file system encoding *is* used... But the error then implies that you are using a filename which can't exist on your system. And that seems strange. — Lennart Regebro, Jan 10 '12 at 10:44
@Lennart Regebro: 1. [I doubt that `compile()` reads the file or touches it in any way](http://ideone.com/6y8xk) 2. In any case Python3 uses wide (Unicode) Windows API for dealing with files and an encoding should not be used. 3. the error seems like an artifact of creation of a code object and has nothing to do with the content of the file or filesystem — jfs, Jan 10 '12 at 11:06
The compile function converts the filename argument to bytes using the filesystem encoding: http://hg.python.org/cpython/file/4f8c24830a5c/Python/bltinmodule.c#l576 . I suspect it shouldn't be doing this. — Thomas K, Jan 10 '12 at 13:25
@J.F.Sebastian: Even if the encoding of the filename is done when creating the code object, it still means he has a filename that mbcs can't handle on a system supposedly using mbcs. And that still seems very strange. But, as Thomas K says, since Python uses the unicode API to talk to the file system, the conversion to mbcs, wherever it is, probably shouldn't be done. At least not on Windows. — Lennart Regebro, Jan 10 '12 at 13:39

score 5 · Answer 1 · answered Jan 10 '12 at 13:52

5

From Python issue 10114, it seems that the logic is that all filenames used by Python should be valid for the platform where they are used. It is encoded using the filesystem encoding to be used in the C internals of Python.

I agree that it probably shouldn't throw an error on Windows, because any Unicode filename is valid. You may wish to file a bug report with Python for this. But be aware that the necessary changes might not be trivial, because any C code using the filename has to have something to do if it can't be encoded.

answered Jan 10 '12 at 13:52

Thomas K

39,200
7
84
86

A related question is why on the latest version of Windows the file system encoding should still be mbcs. – PyScripter Jan 10 '12 at 15:49
@PyScripter: Should it be something else? – Thomas K Jan 10 '12 at 19:33
It should be UTF-16 at least in the modern versions of Windows – PyScripter Jan 11 '12 at 06:20
@PyScripter: I'm not sure about that. Windows has unicode APIs which expect UTF-16 arguments, but the filesystem encoding is for use with bytes-oriented APIs, and I'm pretty sure those expect 8-bit strings, not UTF-16. – Thomas K Jan 11 '12 at 12:37
Python uses the unicode (UTF-16) API for communication with the file system, but it uses mbcs for checking the validity of file names. This leads to the problem of failing to compile perfectly valid file names as demonstrated here. – PyScripter Jan 12 '12 at 00:53
@Pyscripter: Yes, I follow that. But `compile` also works with the filename as a C string (an 8-bit string), so it needs to encode it somehow for that. – Thomas K Jan 12 '12 at 00:57
What it should do on windows is decode the bytes (ansi) string using mbcs and pass the unicode object to the filesystem, not the reverse. But I also understand that in most other operating systems (e.g. linux) the reverse should happen, since they do not have a unicode API for the file system. And I guess this the difficulty in resolving this issue. – PyScripter Jan 12 '12 at 14:22
@PyScripter: Well, what it should do on Windows is to avoid using mbcs at all, store the unicode filename given, and pass it directly to the unicode API. But I don't know how big a change that would need. – Thomas K Jan 12 '12 at 18:39

Framester · Answer 2 · 2012-11-13T14:44:29.803

Here a solution that worked for me: Issue 427: UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range (128):

If you look the PyScripter help file in the topic "Encoded Python Source Files" (last paragraph) it tells you how to configure Python to support other encodings by modifying the site.py file. This file is in the lib subdirectory of the Python installation directory. Find the function setencoding and make sure that the support locale aware default string encodings is on. (see below)

def setencoding():
  """Set the string encoding used by the Unicode implementation.  The
  default is 'ascii', but if you're willing to experiment, you can
  change this."""
  encoding = "ascii" # Default value set by _PyUnicode_Init()
  if 0:  <<<--- set this to 1 ---------------------------------
      # Enable to support locale aware default string encodings.
      import locale
      loc = locale.getdefaultlocale ()
      if loc[1]:
          encoding = loc[1]
  if 0:
      # Enable to switch off string to Unicode coercion and implicit
      # Unicode to string conversion.
      encoding = "undefined"
  if encoding != "ascii":
      # On Non-Unicode builds this will raise an AttributeError...
      sys.setdefaultencoding (encoding) # Needs Python Unicode
build !

Xiangyun Guo · Answer 3 · 2017-07-04T03:38:20.277

0

I think you could try to change the "\" in the path of file into "/"，just like

compile('pass', r'c:\temp\工具\module1.py', 'exec')

compile('pass', r'c:/temp/工具/module1.py', 'exec')

I have met a problem just like you, I used this method to solve the problem. I hope it can work with yours.

edited Jul 04 '17 at 03:38

answered Jul 04 '17 at 02:24

Xiangyun Guo

1
1

UnicodeEncodeError when using the compile function

3 Answers3

Linked