Background
I'm using a Windows machine. I know Python 2.* is not supported anymore, but I'm still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode
was renamed to str
"
I use Git Bash as my main shell.
I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).
Question
- When I evaluate
'á'
, I expect to get'\xc3\xa1'
as shown in this answer - When I evaluate
len('á')
, I expect to get2
, as shown in this answer
But I don't get expected results. When running git bash C:\Python27\python.exe...:
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xa0'
#'\xc3\xa1' expected
>>> len('á')
1
#2 expected
# one more for reference:
>>> 'à'
'\x85'
#'\xc3\xa0' expected
Can you help me understand why I get the output shown above?
Specifically why does 'á'
become '\xa0'
?
What I tried
I can use unicode
object to get the results I expect:
>>> u'á'.encode('utf-8')
'\xc3\xa1'
>>> len(u'á'.encode('utf-8'))
2
I can open IDLE and I get different results -- not expected results, but at least I understand these results.
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xe1'
>>> len('á')
1
>>> 'à'
'\xe0'
The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á'
become '\xe1'
in the Latin 1 encoding.
So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á'
becomes '\xa0'
What I'm wondering
Is my default encoding the problem? I'm too scared to change the default encoding.
>>> import sys; sys.getdefaultencoding()
'ascii'
I feel like it's my terminal's encoding that's the problem? (I use git bash) Should I try to change the PYTHONIOENCODING
environment variable?
I try to check the git bash locale
, the result is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Also I'm using interactive Python , should I try a file instead, using this?
# -*- coding: utf-8 -*- sets the source file's encoding, not the output encoding.
I know upgrading to Python 3 is a solution., but I'm still curious about why my Python 2.7.16 behaves differently.