You are mixing object types.
'£'
is a bytestring, containing encoded data. That those bytes happen to represent a pound sign in your terminal or console is neither here nor there, it could just as much have been a pixel in an image. You terminal or console is configured to produce and accept UTF-8 data instead, so the actual content of that bytestring is the two bytes C2 and A3, when expresed in hexadecimal.
u'1'
on the other hand is a Unicode string. It is unambiguously text data. If you want to concatenate other data to it, it too should be Unicode. Python 2 then will automatically decode str
bytes to Unicode using the default ASCII codec if you try to do this.
However, the '£'
bytestring is not decodable as ASCII. It can be decoded as UTF-8; decode the bytes explicitly, since we know the correct codec here:
print '£'.decode('utf8') + u'1'
When writing bytes to the terminal or console, it is your terminal or console that interprets the bytes and makes sense of them. If you write a unicode
object to the terminal, the sys.stdout
object takes care of encoding, converting the text to bytes your terminal or console will understand.
The same applies to taking input; the sys.stdin
stream produces bytes, which Python can decode transparently when you use the u'£'
syntax to create a Unicode object. You type the character on your keyboard, it is translated to UTF-8 bytes by the terminal or console, and written to Python to interpret.
That writing '\xc2\xa3'
with print
works, then, is a happy coincidence. You could take the unicode
object, encode it to a different codec, and end up with garbage output:
>>> print u'£1'.encode('latin-1')
?1
My Mac terminal converted the data written for the £
sign to a ?
, because the A3 byte (the Latin-1 codepoint for the pound sign) doesn't map to anything when interpreted as UTF-8.
Python determines the terminal or console codec from the locale.getpreferredencoding()
function, you can observe what your terminal or console communicated it uses via the sys.stdout.encoding
and sys.stdin.encoding
attributes:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
Last but not least, you should not confuse printing with the representations echoed by the interpreter in interactive mode. The interpreter shows the outcome of expressions using the repr()
function, a debugging tool that tries to produce Python literal notation wherever possible, using only ASCII characters. For Unicode values, that means any non-printable, non-ASCII character is reflected using escape sequences. This makes the value suitable for copying and pasting without requiring more than an ASCII-capable medium.
The repr()
result of a str
uses \n
for newlines, for example, and \xhh
hex escapes for bytes without dedicated escape sequences, outside the printable range. In addition, for unicode
objects, codepoints outside the Latin-1 range are represented with \uhhhh
and \Uhhhhhhhh
escape sequences depending on wether or not they are part of the basic multilingual plane:
>>> u'''\
... A multiline string to show newlines
... can contain £ latin characters
... or emoji !
... '''
u'A multiline string to show newlines\ncan contain \xa3 latin characters\nor emoji \U0001f4a9!\n'
>>> print _
A multiline string to show newlines
can contain £ latin characters
or emoji !