0

I am always perplexed with the whole hi-ascii handling in python 2.x. I am currently facing an issue in which I have a string with hiascii characters in it. I have a few questions related to it.

  1. How can a string store hiascii characters in it (not a unicode string, but a normal str in python 2.x), which I thought can handle only ascii chars. Does python internally convert the hiascii to something else ?

  2. I have a cli which I spawn as a subprocess from my python code, when I pass this string to the cli, it works fine. While, if I encode this string to utf-8, the cli fails( this string is a password, so it fails saying the password is invalid).

For the second point, I actually did a bit of research and found the following: 1) In windows(sucks), the command line args are encoded in mbcs (sys.getfilesystemencoding). The question I still don't get is, if I read the same string using raw_input, it is encoded in Windows console encoding(on EN windows, it was cp437).

I have a different question that am confused about now regarding Windows encoding. Is the windows sys.stdin.encoding different from Windows console encoding ? If yes, is there a pythonic way to figure out what my windows console encoding is. I needed this because when I read input using raw_input, its encoded in Windows console encoding, and I want to convert it to say, utf-8. Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.

Ankit
  • 11
  • 4
  • "I thought [strings] can handle only ascii chars". What makes you think that? – Kevin May 21 '15 at 18:32
  • Look [here](http://stackoverflow.com/questions/4987327/how-do-i-check-if-a-string-is-unicode-or-ascii). Also note that high-ascii (the extended ascii table) characters' numeric representation is still within 0 to 255 range so a byte could still contain them. – Zach P May 21 '15 at 18:34
  • @Kevin I thought of strings as a sequence of characters, and each character can represent 8 bits, thus only range 0-255. Sorry, not ascii, but 0-255 range. Anything above it cannot come in a string ? – Ankit May 21 '15 at 19:08
  • 1
    Ok, I agree that a single character in Python 2.7 can only range from 0 to 255. But that doesn't seem to conflict with the idea that a string can store hi-ascii characters, if we define "hi-ascii" as "the range of characters having an ordinal value of 128 - 255". Or are you using a different definition? – Kevin May 21 '15 at 19:10
  • I agree that a character can have 0-255 (ascii + high ascii) in it. But what I did was store something higher 'æüÿ€éêè' in it, which still came out as type(str). How can python store these in a str, I thought it will automagically be unicode. – Ankit May 21 '15 at 19:22
  • "Extended ASCII" is also commonly (albeit somewhat incorrectly) referred to as "ANSI". Certain Unicode characters can be encoded as 8bit values in the 128-255 range depending on which ANSI encoding is being used (ISO-8859-1/Latin-1, ISO-8859-2/Latin-2, KOI8-R, etc) to encode them. – Remy Lebeau May 21 '15 at 23:26

2 Answers2

0

To answer your first question, Python 2.x byte strings contain the source-encoded bytes of the characters, meaning the exact bytes used to store the string on disk in the source file. For example, here is a Python 2.x program where the source is saved in Windows-1252 encoding (Notepad's default on US Windows):

#!python2
#coding:windows-1252
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)

Output:

'\xe6\xfc\xff\x80\xe9\xea\xe8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'

The byte string contains the bytes that represent the characters in Windows-1252.

The Python decodes that same sequence of using the declared source encoding (!#coding:Windows-1252) into Unicode codepoints. Since Windows-1252 is very close to iso-8859-1, and iso-8859-1 is a 1:1 mapping to the first 0-255 Unicode codepoints, the code points are almost the same, except for the Euro character.

But save the source in a different encoding, and you'll get those bytes instead for the byte string:

#!python2
#coding:utf8
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)

Output:

'\xc3\xa6\xc3\xbc\xc3\xbf\xe2\x82\xac\xc3\xa9\xc3\xaa\xc3\xa8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'

So, Python 2.X just gives you the source code bytes directly, without decoding them to Unicode codepoints, like a Unicode string would do.

Python 3.X notes that this is confusing, and just forbids non-ASCII characters in byte strings:

#!python3
#coding:utf8
s = b'æüÿ€éêè'
u = 'æüÿ€éêè'
print(repr(s))
print(repr(u))

Output:

  File "C:\test.py", line 3
    s = b'æüÿ\u20acéêè'
       ^
SyntaxError: bytes can only contain ASCII literal characters.

To answer your second question, please edit your question to show an example that demonstrates the problem.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Is the windows sys.stdin.encoding different from Windows console encoding?

Yes. There are two locale-specific codepages:

  • the ANSI code page, aka mbcs, used for strings in Win32 ...A APIs and hence for C runtime operations like reading the command line;

  • the IO code page, used for stdin/stdout/stderr streams.

These do not have to be the same encoding, and typically they aren't. In my locale (UK), the ANSI code page is 1252 and the IO code page defaults to 850. You can change the console code page using the chcp command, so you can make the two encodings match using eg chcp 1252 before running the Python command.

(You also have to be using a TrueType font in the console for chcp to make any difference.)

is there a pythonic way to figure out what my windows console encoding is.

Python reads it at startup using the Win32 API GetConsoleOutputCP and—unless overridden by PYTHONIOENCODING—writes that to the property sys.stdout.encoding. (Similarly GetConsoleCP for stdin though they will generally be the same code page.)

If you need to read this directly, regardless of whether PYTHONIOENCODING is set, you might have to use ctypes to call GetConsoleOutputCP directly.

Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.

(How have you done that? It's a read-only property.)

Although you can certainly treat input and output as UTF-8 at your end, the Windows console won't supply or display content in that encoding. Most other tools you call via the command line will also be treating their input as encoded in the IO code page, so would misinterpret any UTF-8 sent to them.

You can affect what code page the console side uses by calling the Win32 SetConsoleCP/SetConsoleOutputCP APIs with ctypes (equivalent of the chcp command and also requires TTF console font). In principle you should be able to set code page 65001 and get something that is nearly UTF-8. Unfortunately long-standing console bugs usually make this approach infeasible.

windows(sucks)

yes.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Thanks a lot for a detailed response. I had a related question. I have my PYTHONIOENCODING set to utf-8, which I believe made sys.stdin.encoding and sys.stdout.encoding change to utf-8. But even then, my raw_input is returning a 'cp437' encoded input. Shouldn't it give me a utf-8 encoded string ? – Ankit May 22 '15 at 19:06
  • No, PYTHONIOENCODING only changes the Python side: how Python programs might interpret bytes they have been sent, and how Unicode strings are encoded to bytes when printing. It doesn't affect what bytes are passed to Python, or printed from its output, by the console. – bobince May 23 '15 at 10:06