Python str vs unicode on Windows, Python 2.7, why does 'á' become '\xa0'

Question

Background

I'm using a Windows machine. I know Python 2.* is not supported anymore, but I'm still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode was renamed to str"

I use Git Bash as my main shell.

I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).

Question

When I evaluate 'á', I expect to get '\xc3\xa1' as shown in this answer
When I evaluate len('á'), I expect to get 2, as shown in this answer

But I don't get expected results. When running git bash C:\Python27\python.exe...:

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32

>>> 'á'
'\xa0'
#'\xc3\xa1' expected

>>> len('á') 
1
#2 expected

# one more for reference:
>>> 'à'
'\x85'
#'\xc3\xa0' expected

Can you help me understand why I get the output shown above?

Specifically why does 'á' become '\xa0'?

What I tried

I can use unicode object to get the results I expect:

>>> u'á'.encode('utf-8')
'\xc3\xa1'
>>> len(u'á'.encode('utf-8'))
2

I can open IDLE and I get different results -- not expected results, but at least I understand these results.

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xe1'
>>> len('á')
1
>>> 'à'
'\xe0'

The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á' become '\xe1' in the Latin 1 encoding.

So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á' becomes '\xa0'

What I'm wondering

Is my default encoding the problem? I'm too scared to change the default encoding.

>>> import sys; sys.getdefaultencoding()
'ascii'

I feel like it's my terminal's encoding that's the problem? (I use git bash) Should I try to change the PYTHONIOENCODING environment variable?

I try to check the git bash locale, the result is:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Also I'm using interactive Python , should I try a file instead, using this?

# -*- coding: utf-8 -*- sets the source file's encoding, not the output encoding.

I know upgrading to Python 3 is a solution., but I'm still curious about why my Python 2.7.16 behaves differently.

so, in Python 2, that is interpreted as raw bytes. So it's using whatever encoding your shell is using. If you try using a file with explicitly setting the encoding, you should get what you expect (note, that actually isn't necessary because utf-8 is the default file encoding) — juanpa.arrivillaga, Mar 03 '23 at 20:07
But yeah, I'm not sure what encoding scheme is interpreting that as byte 160 — juanpa.arrivillaga, Mar 03 '23 at 20:10
@juanpa.arrivillaga you said: "whatever encoding your shell is using" That's what I mean when I say: "I feel like it's my terminal's encoding that's the problem?". I guess I'm curious **how to get or set my terminal's encoding, do you know?** you said "try using a file with explicitly setting the encodoing", That's what I mean when I say: "should I try a file instead, using this? `-*- coding: utf-8 -*-`", sounds like that would help, but I'm still curious about the terminal/ shell issue... — Nate Anderson, Mar 03 '23 at 20:11
Probably the old [IBM code page 437](https://en.wikipedia.org/wiki/Code_page_437) from the DOS days. — dan04, Mar 03 '23 at 20:11
@NateAnderson definitely try it and report back the results! Also, ignore what I said about the default encoding being utf-8 for the source code, that's definitely true in Python 3, not sure about Python 2. So just set it and let's see — juanpa.arrivillaga, Mar 03 '23 at 20:12
Thank you both, thanks @dan04 for suggesting **IBM code page 437** -- I realize this question is very narrow (unlikely to be helpful). Also I should clarify it's not just Git Bash, it's a VS Code terminal running Git Bash.... for some reason running Git Bash itself [freezes up when I try to run C:/Python27/python.exe directly](https://stackoverflow.com/a/36530750/1175496)) OK juanpa.arrivillaga I'll try using a file instead of the interpreter. — Nate Anderson, Mar 03 '23 at 20:15
IDLE is a GUI and defaults to the default ANSI code page (Windows-1252 for US and Western European Windows). The command prompt uses the default OEM code page (cp437 for US Windows and typically cp850 for Western European Windows. Windows-1252 encodes á as E1 and the OEM code pages use E0. — Mark Tolonen, Mar 03 '23 at 20:17
Thanks everyone. If someone wants to post the answer to get credit I will accept it. Also I will upvote your comments later (I reached my comment upvote limit today). Otherwise seems like this was just my confusion about an (esoteric?) encoding. I guess one hint is that I was apparently using an encoding where á and à are *not adjacent* (their byte values are quite different! '\xa0' vs '\x85' are far apart, vs '\xe1' and '\xe0' ). And another hint is knowledge about the command prompt/ OEM Codepage, like @MarkTolonen suggested — Nate Anderson, Mar 03 '23 at 20:24

Nate Anderson · Accepted Answer · 2023-03-05T19:20:00.083

Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:

command prompt uses the default OEM code page (cp437 for US Windows ....)"

This seems clear from checking code page 437 for the values I'm trying to encode:

>>> 'á' #-> '\xa0' expected in code page 437
>>> 'à' #-> '\x85' expected in code page 437

I highlight those values in the screenshot below. $screenshot of code page 437 from https://en.wikipedia.org/wiki/Code_page_437 highlighting the characters à (mapping to byte \x85) and á (mapping to byte \xa0)$

I used @MarkTolonen's suggestion of running the chcp command to get or set the encoding used by my shell/terminal. chcp is short for "change code page". If you're using Git Bash, use chcp.com instead. Sure enough, when I run chcp, the output is Active code page: 437:

$a screenshot of two terminals/shells. on the left, git bash, with the command chcp, which returns "bash chcp: command not found". Then the command chcp.com, which returns "Active code page: 437". On the right, cmd, (Windows co mmand line), with the command chcp, which returns "Active code page: 437". Then the command where chcp, which returns "C:\Windows\System32\chcp.com"$

Then I tried @juanpa.arrivillaga's suggestion of using a file. First I tried a file that explicitly used the 437 code page.

I added the "magic comment" to specify encoding 437: # -*- coding: cp437 -*-, but that's not enough to encode the file. The magic comment explains to Python how to decode the file.
I also had to change the encoding of the file (tell my editor, VS Code, how to encode in CP437).

Once I do both those things with a Python file (encode and decode with CP437), I get the same "unexpected" results as my OP, which confirms that CP437 is indeed the encoding used by my terminal/shell.

In general you must both encode and include the "decode magic comment", and make sure your shell uses the same encoding!

If I include the cp437 "magic comment" without encoding in CP437 (VS Code default encoding is UTF-8), the length of 'á' is 2; as in UTF-8! (Note the results are printed in my CP437 shell so they look strange; I see character ├ , which is \xc3 in CP437!)
If I encode in CP437 but I don't include the magic comment, I get an error: (SyntaxError: Non-ASCII character '\xa0' in file 437_encoding.py on line 4)

If I encode in utf-8, and I include the "magic comment" for utf-8, and I change my shell to use utf-8 (chcp.com 65001), then I get the results I expect!

Finally, if I try @MarkTolonen's suggestion to use sys.stdout.encoding, it will tell me the results 'cp437'!

Please note sys.stdout.encoding (which for me had the value cp437)...
is not the same as sys.getdefaultencoding() (which for me had the value ascii...

And if I try to check sys.stdout.encoding when I used chcp.com to change the code page to UTF-8 (value 65001), I get an error LookupError: unknown encoding: cp65001 which is described in more detail here

FYI you can run `chcp` in the command prompt to see the code page used, and can change it as well, e.g. `chcp 1252` — Mark Tolonen, Mar 03 '23 at 20:27
In Python `import sys; print(sys.stdout.encoding)` should work in both command prompt and idle. It may not work in all IDEs as some redirect the output to custom objects that don't support the `encoding` attribute. — Mark Tolonen, Mar 03 '23 at 20:31
Thanks @MarkTolonen , you're right 1) running `chcp` is helpful to get or set the shell encoding and 2) running `sys.stdout.encoding` is helpful to get the shell encoding (vs what I was using, `sys.getdefaultencoding()` — Nate Anderson, Mar 05 '23 at 20:40

Python str vs unicode on Windows, Python 2.7, why does 'á' become '\xa0'

1 Answers1

Linked