2

I'm trying to run the command echo hej värld (swedish for "hello world") through python code.

So far i have tested:

# -*- coding: utf-8 -*-
import subprocess
print subprocess.check_output("Echo hej värld", shell = True)

And

# -*- coding: utf-8 -*-
import os
os.system("Echo hej värld")

Both versions return hej värld

If i simply type the command into the CMD prompt it returns the proper version, using ä.

Jack Pettersson
  • 1,606
  • 4
  • 17
  • 28

1 Answers1

2

I could do some tests on a windows 7 system. The problem is not on the execution of a command but only on the display of UTF-8 characters.

First, it works almost correctly using Python 3.4 : it can display ä without problems. So I assume you are using a 2.x version.

On a 2.x version, it is almost impossible to have proper display of UTF8 strings. If you manage to do it correctly, the driver will complain because the number of characters is different than the number of bytes.

You could find some more references here : Windows cmd encoding change causes Python crash. In particular, the referenced Python bug was still active the 2014-10-02 ...

So what to do ?

The only correct solution in Windows is to use a 8bits only character set. Latin1 (windows cp 1252) should display swedish characters provided you use a Consolas font. CP850 is normally the OEM raster character set (in western Europe) and works also correctly.

EDIT : concrete how-to

  • for Python 2.7 :

    #first define a unicode string in a portable way
    utxt = u"Echo hej v\u00e4rld"
    #convert it in ANSI (whatever the current console cp can be)
    txt = utxt.encode('cp1252')
    
    os.system('echo ' + txt)
    
  • for Python 3.x :

    #first define a unicode string in a portable way
    utxt = u"Echo hej v\u00e4rld"
    
    os.system('echo ' + txt)
    

Of course, if you have the # -*- coding: utf-8 -*- line, you can safely write värld instead of v\u00e4rld

EDIT (4):

eryksun's comment is the proper explaination to what happens. Python 2.7 uses CreateProcessA meaning it wants the input of the command in what Windows uses for its ANSI code page and not the OEM code page. So for a system using Windows 1252 as its ANSI code page, you must convert the command to cp1252.

Latin1 (or iso-8859-1), Latin9 (iso-8859-15) and windows 1252 are almost the same character set ... but the sign is the difference between them ! And if you want it under windows you must use the cp1252 variant

Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Tried both 1252 and 850, using consolas no luck :( And yes, i'm using 2.7. – Jack Pettersson Nov 18 '14 at 13:35
  • I can't manage to make it work :/ Did the method you describe in your edit work for you? – Jack Pettersson Nov 18 '14 at 14:05
  • I tried it under windows 7 french version and it worked (see last edit) – Serge Ballesta Nov 18 '14 at 14:05
  • Last edit worked! You are a god! Thanks alot :) character encoding in CMD is aweful. Edit: And just FYI, character pages seems to be independent of pythons encoding. You can use any chcp, latin1 was what did it. – Jack Pettersson Nov 18 '14 at 14:08
  • Python 2 `system` and `subprocess.Popen` call `CreateProcessA`, which decodes the command line to Unicode as an ANSI string, i.e. with the encoding `ansi = locale.getpreferredencoding()`. Since the `echo` command writes unicode to the console by calling `WriteConsoleW`, the output codepage shouldn't be an issue, other than for selecting a font with the proper glyphs. – Eryk Sun Nov 18 '14 at 15:44
  • @eryksun : Thank you ! I know understand what happens ... I've edited my post with your comment – Serge Ballesta Nov 18 '14 at 16:07
  • You'd have to change your system locale and reboot to practically test this. For example, by changing the system locale to Greek in the control panel's region and language settings. In codepage 1253 `'\xe4'` maps to `u'δ'`, so in Python 2 I'd expect `os.system('echo hej v\xe4rld')` to echo "hej vδrld" to the console. That's what I get for the following: `ctypes.windll.kernel32.SetConsoleOutputCP(1253);` `print 'hej v\xe4rld'`. However, decoding according to the console codepage isn't the same decoding path as what `CreateProcessA` uses. The latter is hard-coded at boot time. – Eryk Sun Nov 18 '14 at 16:24
  • @eryksun : It should break before. At line `txt = utxt.encode('cp1253')`, I get `UnicodeEncodeError: 'charmap' codec can't encode character u'\xe4' in position 1: character maps to ` – Serge Ballesta Nov 18 '14 at 16:59
  • I used the Python 2 byte string `'\xe4'`. The Unicode string `u'\xe4'` is "ä", which isn't mapped in the Greek codepage, 1253. – Eryk Sun Nov 18 '14 at 17:19