13

Why does the following occur:

>>> u'\u0308'.encode('mbcs')   #UMLAUT
'\xa8'
>>> u'\u041A'.encode('mbcs')   #CYRILLIC CAPITAL LETTER KA
'?'
>>>

I have a Python application accepting filenames from the operating system. It works for some international users, but not others.

For example, this unicode filename: u'\u041a\u0433\u044b\u044b\u0448\u0444\u0442'

will not encode with Windows 'mbcs' encoding (the one used by the filesystem, returned by sys.getfilesystemencoding()). I get '???????', indicating the encoder fails on those characters. But this makes no sense, since the filename came from the user to begin with.

Update: Here's the background to my reasons behind this... I have a file on my system with the name in Cyrillic. I want to call subprocess.Popen() with that file as an argument. Popen won't handle unicode. Normally I can get away with encoding the argument with the codec given by sys.getfilesystemencoding(). In this case it won't work

Norman
  • 581
  • 1
  • 5
  • 10
  • Please add information about the Popen call: what's the executable? is it written by you? – tzot Dec 29 '09 at 01:48
  • The executable is written by my team; we've resolved this issue by encoding the command-line in utf-8, and having the called executable decode it. (as suggested by John Machin below. Thanks!) – Norman Dec 30 '09 at 23:02
  • In order to get a valid answer please post a sample section of the problematic code (popen). In any case calling encode('mbcs') will not solve your problem. Anything that will use the current codepage is not going to be a valid solution. – sorin Jan 06 '10 at 14:55

5 Answers5

8

In Py3K - at least from Python 3.2 - subprocess.Popen and sys.argv work consistently with (default unicode) strings on Windows. CreateProcessW and GetCommandLineW are used obviously.

In Python - up to v2.7.2 at least - subprocess.Popen is buggy with Unicode arguments. It sticks to CreateProcessA (while os.* are consistent with Unicode). And shlex.split creates additional nonsense.

Pywin32's win32process.CreateProcess also doesn't auto-switch to the W version, nor is there a win32process.CreateProcessW. Same with GetCommandLine. Thus ctypes.windll.kernel32.CreateProcessW... needs to be used. The subprocess module perhaps should be fixed regarding this issue.

UTF8 on argv[1:] with private apps remains clumsy on a Unicode OS. Such tricks may be legal for 8-bit "Latin1" string OSes like Linux.

UPDATE vaab has created a patched version of Popen for Python 2.7 which fixes the issue.
See https://gist.github.com/vaab/2ad7051fc193167f15f85ef573e54eb9
Blog post with explanations: http://vaab.blog.kal.fr/2017/03/16/fixing-windows-python-2-7-unicode-issue-with-subprocesss-popen/

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
kxr
  • 4,841
  • 1
  • 49
  • 32
  • 1
    There is an open bug for this: http://bugs.python.org/issue1759845 They aren't going to fix it. The only solution is to move to Python 3. – Aaron Digulla Nov 27 '13 at 12:57
  • 1
    @AaronDigulla: I see at least two workarounds based on this answer on Python 2.7 for `subprocess`' args with characters outside `mbcs` encoding: 1. call appropriate for your case `os.*` function that might already support Unicode 2. call `CreateProcessW` directly with Unicode arguments – jfs Mar 22 '14 at 18:49
  • @J.F.Sebastian: I was referring to "The subprocess module perhaps should be fixed". If you want to use the official API, then Python 3 is the only solution. That said, do you have a working example how to replace `subprocess.Popen` with `CreateProcessW`? – Aaron Digulla Mar 24 '14 at 12:45
  • @AaronDigulla: no. I haven't tried to call `CreateProcessW` manually. – jfs May 06 '15 at 16:36
  • 1
    @AaronDigulla: I have released a fix that exploit @J.F.Sebastian 's suggestions by calling ``CreateProcessW(..)`` via ``ctypes`` for python 2.7 on windows. [check this](http://vaab.blog.kal.fr/2017/03/16/fixing-windows-python-2-7-unicode-issue-with-subprocesss-popen/) – vaab Mar 16 '17 at 10:48
  • 1
    @vaab How about posting an answer which we can upvote? – Aaron Digulla May 30 '17 at 13:40
  • @AaronDigulla: done. I posted answer that can be upvoted, with hopefully a nice sum up of the situation, although I must say @ kxr has a very nice one already. – vaab May 31 '17 at 03:39
5

DISCLAIMER: I'm the author of the fix mentionned in the following.

To support unicode command line on windows with python 2.7, you can use this patch to subprocess.Popen(..)

The situation

Python 2 support of unicode command line on windows is very poor.

Are severly bugged:

  • issuing the unicode command line to the system from the caller side (via subprocess.Popen(..)),

  • and reading the current command line unicode arguments from the callee side (via sys.argv),

It is acknowledged and won't be fixed on Python 2. These are fixed in Python 3.

Technical Reasons

In Python 2, windows implementation of subprocess.Popen(..) and sys.argv use the non unicode ready windows systems call CreateProcess(..) (see python code, and MSDN doc of CreateProcess) and does not use GetCommandLineW(..) for sys.argv.

In Python 3, windows implementation of subprocess.Popen(..) make use of the correct windows systems calls CreateProcessW(..) starting from 3.0 (see code in 3.0) and sys.argv uses GetCommandLineW(..) starting from 3.3 (see code in 3.3).

How is it fixed

The given patch will leverage ctypes module to call C windows system CreateProcessW(..) directly. It proposes a new fixed Popen object by overriding private method Popen._execute_child(..) and private function _subprocess.CreateProcess(..) to setup and use CreateProcessW(..) from windows system lib in a way that mimics as much as possible how it is done in Python 3.6.

How to use it

How to use the given patch is demonstrated with this blog post explanation. It additionally shows how to read the current processes sys.argv with another fix.

vaab
  • 9,685
  • 7
  • 55
  • 60
  • 1
    +1: I haven't tested it but it looks like a workable solution (`from win_subprocess import Popen` as a drop-in replacement, to enable `Popen(u"unicode command", ...)` where [`win_subprocess.py`](https://gist.github.com/vaab/2ad7051fc193167f15f85ef573e54eb9#file-win_subprocess-py)_ – jfs Sep 07 '17 at 19:53
3

Docs for sys.getfilesystemencoding() say that for Windows NT and later, file names are natively Unicode. If you have a valid unicode file name, why would you bother encoding it using mbcs?

Docs for codecs module say that mbcs encodes using "ANSI code page" (which will differ depending on user's locale) so if the locale doesn't use Cyrillic characters, splat.

Edit: So your process is calling subprocess.Popen(). If your invoked process is under your control, the two processes ahould be able to agree to use UTF-8 as the Unicode Transport Format. Otherwise, you may need to ask on the pywin32 mailing list. In any case, edit your question to state the degree of control you have over the invoked process.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • 1
    I have a file on my system with the name in Cyrillic. I want to call subprocess.Popen() with that file as an argument. Popen won't handle unicode. Normally I can get away with encoding the argument with the codec given by sys.getfilesystemencoding(). In this case it won't work. – Norman Dec 15 '09 at 23:30
  • @Norman: please edit your question to include the info about subprocess.Popen() – John Machin Dec 16 '09 at 04:20
2

If you need to pass the name of an existing file, then you might have a better chance of success by passing the 8.3 version of the Unicode filename.

You need to have the pywin32 package installed, then you can do:

>>> import win32api
>>> win32api.GetShortPathName(u"C:\\Program Files")
'C:\\PROGRA~1'

I believe these short filenames use only ASCII characters, and therefore you should be able to use them as arguments to a command line.

Should you need to specify also filenames to be created, you can create them with zero size in advance from Python using Unicode filenames, and pass the short name of the file as an argument.

UPDATE: user bogdan says correctly that 8.3 filename generation can be disabled (I had it disabled, too, when I had Windows XP on my laptop), so you can't rely on them. So, as another more far-fetched approach when working on NTFS volumes, one can hard link the Unicode filenames to plain ASCII ones; pass the ASCII filenames to an external command and delete them afterwards.

tzot
  • 92,761
  • 29
  • 141
  • 204
  • 3
    You should never try to use 8.3 filenames, please remember that these are optional and they can be missing. It's a common practice to disable NTFS shortfilename generation in order to speedup filesystem. – bogdan Jan 04 '10 at 15:14
  • If I may object to your first subsentence: one can *try* using 8.3 filenames, but should *not rely* on them. Ergo my "you might have a better chance". – tzot Jan 04 '10 at 23:46
0

With Python 3, just don't encode the string. Windows filenames are natively Unicode, and all strings in Python 3 are Unicode, and Popen uses the Unicode version of the CreateProcess Windows API function.

With Python 2.7, the easiest solution is to use the third-party module https://pypi.org/project/subprocessww/. There is no "built-in" solution to get full Unicode support (independent of system locale), and the maintainers of Python 2.7 consider this a feature request rather than a bugfix, so this is not going to change.

For a detailed technical explanation of why things are as they are, please see the other answers.

Florian Winter
  • 4,750
  • 1
  • 44
  • 69