10

I'm trying to run subprocess.call() with unicode filename, and here is simplified problem:

n = u'c:\\windows\\notepad.exe '
f = u'c:\\temp\\nèw.txt'

subprocess.call(n + f)

which raises famous error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8'

Encoding to utf-8 produces wrong filename, and mbcs passes filename as new.txt without accent

I just can't read any more on this confusing subject and spin in circle. I found here lot of answers for many different problems in past so I thought to join and ask for help myself

Thanks

otrov
  • 135
  • 1
  • 2
  • 7
  • Depending on your operating system, what happens if you use latin-1 or cp1252 as your encoding? – Kathy Van Stone Apr 07 '10 at 20:01
  • 1
    Have you specified the encoding of the source file? – Humphrey Bogart Apr 07 '10 at 20:13
  • source file is utf encoded: # -*- coding: utf-8 -*- I use the trick with latin-1 from time to time but can't in this case: 1. I need also other characters that aren't in latin-1 2. Unfortunately it doesn't work with subprocess - same error is raised, thou I encoded both strings with same latin-1 encoding Thanks for all answers – otrov Apr 07 '10 at 20:46
  • Is this Python 2.x or 3.x? If 2.x, maybe you can try it on 3.x. – Craig McQueen Apr 08 '10 at 00:04
  • It's 2.6 and thought about changing to 3, but not right now – otrov Apr 08 '10 at 00:49
  • 1
    It looks like [Python 3 should already support Unicode arguments with `subprocess.call()`](http://bugs.python.org/issue19264) – jfs Mar 22 '14 at 18:26

7 Answers7

8

I found a fine workaround, it's a bit messy, but it works.

subprocess.call is going to pass the text in its own encoding to the terminal, which might or not be the one it's expecting. Because you want to make it portable, you'll need to know the machine's encoding at runtime.

The following

notepad = 'C://Notepad.exe'
subprocess.call([notepad.encode(sys.getfilesystemencoding())])

attempts to figure out the current encoding and therefore applies the correct one to subprocess.call

As a sidenote, I have also found that if you attempt to compose a string with the current directory, using

os.cwd() 

Python (or the OS, don't know) will mess up directories with accented characters. To prevent this I have found the following to work:

os.cwd().decode(sys.getfilesystemencoding())

Which is very similar to the solution above.

Hope it helps.

RedOrav
  • 953
  • 9
  • 21
  • OP says: *"mbcs passes filename as new.txt without accent"*. `mbcs` is `sys.getfilesystemencoding()` on Windows i.e., `.encode(sys.getfilesystemencoding())` doesn't work in this case. – jfs Mar 22 '14 at 18:05
  • @J.F.Sebastian we don't see the same OP ;) The file `nèw.txt` is with accents. – Kpym Dec 17 '15 at 10:25
  • @Kpym: it is a direct quote from the question that means that encoding a Unicode name with accent (`nèw.txt`) using Windows ANSI codepage (`mbcs`) may and does lose the accent on OP's system e.g., `u'nèw.txt'.encode('ascii', 'ignore')` -> `b'new.txt'` (the actual codepage is not ascii) – jfs Dec 17 '15 at 14:41
6

If your file exists, you can use short filename (aka 8.3 name). This name is defined for existent files, and should cause no trouble to non-Unicode aware programs when passed as argument.

One way to obtain one (needs Pywin32 to be installed):

import win32api
short_path = win32api.GetShortPathName(unicode_path)

Alternatively, you can also use ctypes:

import ctypes
import ctypes.wintypes

ctypes.windll.kernel32.GetShortPathNameW.argtypes = [
    ctypes.wintypes.LPCWSTR, # lpszLongPath
    ctypes.wintypes.LPWSTR, # lpszShortPath
    ctypes.wintypes.DWORD # cchBuffer
]
ctypes.windll.kernel32.GetShortPathNameW.restype = ctypes.wintypes.DWORD

buf = ctypes.create_unicode_buffer(1024) # adjust buffer size, if necessary
ctypes.windll.kernel32.GetShortPathNameW(unicode_path, buf, len(buf))

short_path = buf.value
WGH
  • 3,222
  • 2
  • 29
  • 42
1

It appears that to make this work, the subprocess code would have to be modified to use a wide character version of CreateProcess (assuming that one exists). There's a PEP discussing the same change made for the file object at http://www.python.org/dev/peps/pep-0277/ Perhaps you could research the Windows C calls and propose a similar change for subprocess.

clahey
  • 4,795
  • 3
  • 27
  • 20
  • I don't feel up to the task for researching in this problem, thou funny to see it's author (Neil), who just released SciTE 2.10 with support for unicode (wide char) file name access – otrov Apr 07 '10 at 20:42
0

I don't have an answer for you, but I've done a fair amount of research into this problem. Python converts all output (including system calls) to the same character as the terminal it is running in. Windows terminals use code pages for character mapping; the default code page is 437, but it can be changed with the chcp command. chcp 65001 will theoretically change the code page to utf-8, but as far as I know python doesn't know what to do with this, so you're SOL.

0

You can try opening the file as:

subprocess.call((n + f).encode("cp437"))

or whichever codepage chcp reports as being used in a command prompt window. If you try to chcp 65001 as starbuck suggested, you'll have to edit the stdlib encodings\aliases.py file and add cp65001 as an alias to 'utf-8' beforehand. It's an open issue in the Python source.

UPDATE: since this is a multiple target scenario, before running such a command, make sure you run a single chcp command first, analyse the output and retrieve the current "Command Prompt" (DOS) codepage. Subsequently, use the discovered codepage to encode the subprocess.call argument.

tzot
  • 92,761
  • 29
  • 141
  • 204
  • I'm on cp1251, but program is supposed to run on different machines with arbitrary locale – otrov Apr 08 '10 at 00:48
  • cp1251 is the Windows codepage. When running commands with subprocess, you need to use the "DOS"/command prompt codepage. – tzot Apr 08 '10 at 20:45
  • @tzot: it is incorrect unless you mean `mbcs` encoding (you could see its value using `locale.getpreferredencoding()`) and OP already said that `mbcs` on his system doesn't support required characters. `chcp` may return different encoding. – jfs Mar 22 '14 at 18:31
0

As ΤΖΩΤΖΙΟΥ and starbuck mentioned, the problem is with the console code page which is in your case 866 (in Russian localization of windows) and not 1251. Just run chcp in console.

The problem is the same as when you want output unicode to Windows console. Unfortunatelly you will need at least to reqister and alias for unicode as 'cp866' in encodings\aliases.py (or do it programmatically on script start) and change the code page of the console to 65001 before running the notepad and setting it back afterwards.

chcp 65001 & c:\WINDOWS\notepad.exe nèw.txt & chcp 866

By the way, to be able to run the command in console and see the filename correctly, you will need to change the console font to Lucida Console in console window properties.

It might be even worse: you will need to change the code page of the current process. To do that, you will need either run chcp 65001 right before the script start or use pywin32 to do it within the script.

newtover
  • 31,286
  • 11
  • 84
  • 89
  • Thanks for all the efforts guys, much appreciated :) Unfortunately I can't make it to work. String passed to subprocess(), or more precisely CreateProcess() is printed as "chcp 65001 & c:\windows\notepad.exe nèw.txt" which throws error "system cannot find the file specified". Maybe I'm doing it wrongly but I tried what I understand I don't have problem pasting unicode filename in windows console in my current cp, which can be seen here: http://img402.imageshack.us/img402/9875/sshot1x.png – otrov Apr 08 '10 at 14:16
0

Use os.startfile with the operation edit. This will work better as it will open the default application for your extension.

j0k
  • 22,600
  • 28
  • 79
  • 90