I thought I knew everything about encodings and Python, but today I came across a weird problem: although the console is set to code page 850 - and Python reports it correctly - parameters I put on the command line seem to be encoded in code page 1252. If I try to decode them with sys.stdin.encoding, I get the wrong result. If I assume 'cp1252', ignoring what sys.stdout.encoding reports, it works.
Am I missing something, or is this a bug in Python ? Windows ? Note: I am running Python 2.6.6 on Windows 7 EN, locale set to French (Switzerland).
In the test program below, I check that literals are correctly interpreted and can be printed - this works. But all values I pass on the command line seem to be encoded wrongly:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys
literal_mb = 'utf-8 literal: üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')
print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
arg = sys.argv[i]
print "arg",i,":",arg
for ch in arg:
print " ",ch,"->",ord(ch),
if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
else:
print ""
In a newly created console, when running
C:\dev>test-encoding.py abcé€
I get the following output
Testing literals
utf-8 literal: üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
a -> 97
b -> 98
c -> 99
Ú -> 233 <- é [assuming input was actually cp1252 ]
Ç -> 128 <- ? [assuming input was actually cp1252 ]
while I would expect the 4th character to have an ordinal value of 130 instead of 233 (see the code pages 850 and 1252).
Notes: the value of 128 for the euro symbol is a mystery - since cp850 does not have it. Otherwise, the '?' are expected - cp850 cannot print the characters and I have used 'replace' in the conversions.
If I change the code page of the console to 1252 by issuing chcp 1252
and run the same command, I (correctly) obtain
Testing literals
utf-8 literal: üèéÃÂç€ÈÚ
unicode literal: üèéÃÂç€ÈÚ
Testing arguments ( stdin/out encodings: cp1252 / cp1252 )
arg 1 : abcé€
a -> 97
b -> 98
c -> 99
é -> 233
€ -> 128
Any ideas what I'm missing ?
Edit 1: I've just tested by reading sys.stdin. This works as expected: in cp850, typing 'é' results in an ordinal value of 130. So the problem is really for the command line only. So, is the command line treated differently than the standard input ?
Edit 2: It seems I had the wrong keywords. I found another very close topic on SO: Read Unicode characters from command-line arguments in Python 2.x on Windows. Still, if the command line is not encoded like sys.stdin, and since sys.getdefaultencoding() reports 'ascii', it seems there is no way to know its actual encoding. I find the answer using win32 extensions pretty hacky.