11

I've observed the following:

>>> print '£' + '1'
£1
>>> print '£' + u'1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> print u'£' + u'1'
£1
>>> print u'£' + '1'
£1

Why does '£' + '1' work but '£' + u'1' doesn't work?

I looked at the types:

>>> type('£' + '1')
<type 'str'>
>>> type('£' + u'1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>>> type(u'£' + u'1')
<type 'unicode'>

This also confuses me. If '£' + '1' is a str and not a unicode, why does it print properly on my terminal? Shouldn't it print something like '\xc2\xa31'?

To add to the mix, I've also observed the following:

>>> u'£' + '1'
u'\xa31'
>>> type('1')
<type 'str'>
>>> type(u'£')
<type 'unicode'>
>>> print u'£' + '1'
£1

Why does u'£' + '1' not print out the £ symbol properly, whereas print u'£' + '1' does? Is it because repr is used in the former, whereas str is used in the latter?

Also, how come concatenation of a unicode and a str work in this case, but not in the '£' + u'1' case?

texasflood
  • 1,571
  • 1
  • 13
  • 22
  • Afaik you can only concat strings of the same type, i.e. `u'£'+u'1'` or `'£'+'1'`. You cannot mix them. – Bjorn Aug 02 '15 at 12:14
  • You are trying to decode as ascii with `print '£' + u'1'`, you are never going to see `'\xc2\xa31'` when you print unless you print the `repr` of the object, `print '£' + '1'` works because your shell is configured to accept utf-8 – Padraic Cunningham Aug 02 '15 at 12:20
  • @Bjorn You can, I've done it many times, see the updated question – texasflood Aug 02 '15 at 12:23
  • Boy, you are really hitting all the duplicates. I should just have closed this as one. – Martijn Pieters Aug 02 '15 at 12:29
  • @MartijnPieters Sorry about that, could you point me to them? – texasflood Aug 02 '15 at 12:42
  • @texasflood: a search through my `unicode` answers turns up [Python ascii utf unicode](https://stackoverflow.com/q/27256006), [Python str vs unicode types](https://stackoverflow.com/q/18034272) and [How are these strings represented internally in Python interpreter ? I don't understand](https://stackoverflow.com/q/14839028), [Two apparently equal Python Unicode UTF8-encoded strings don't match](https://stackoverflow.com/q/17343307), – Martijn Pieters Aug 02 '15 at 12:58

1 Answers1

14

You are mixing object types.

'£' is a bytestring, containing encoded data. That those bytes happen to represent a pound sign in your terminal or console is neither here nor there, it could just as much have been a pixel in an image. You terminal or console is configured to produce and accept UTF-8 data instead, so the actual content of that bytestring is the two bytes C2 and A3, when expresed in hexadecimal.

u'1' on the other hand is a Unicode string. It is unambiguously text data. If you want to concatenate other data to it, it too should be Unicode. Python 2 then will automatically decode str bytes to Unicode using the default ASCII codec if you try to do this.

However, the '£' bytestring is not decodable as ASCII. It can be decoded as UTF-8; decode the bytes explicitly, since we know the correct codec here:

print '£'.decode('utf8') + u'1'

When writing bytes to the terminal or console, it is your terminal or console that interprets the bytes and makes sense of them. If you write a unicode object to the terminal, the sys.stdout object takes care of encoding, converting the text to bytes your terminal or console will understand.

The same applies to taking input; the sys.stdin stream produces bytes, which Python can decode transparently when you use the u'£' syntax to create a Unicode object. You type the character on your keyboard, it is translated to UTF-8 bytes by the terminal or console, and written to Python to interpret.

That writing '\xc2\xa3' with print works, then, is a happy coincidence. You could take the unicode object, encode it to a different codec, and end up with garbage output:

>>> print u'£1'.encode('latin-1')
?1

My Mac terminal converted the data written for the £ sign to a ?, because the A3 byte (the Latin-1 codepoint for the pound sign) doesn't map to anything when interpreted as UTF-8.

Python determines the terminal or console codec from the locale.getpreferredencoding() function, you can observe what your terminal or console communicated it uses via the sys.stdout.encoding and sys.stdin.encoding attributes:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'

Last but not least, you should not confuse printing with the representations echoed by the interpreter in interactive mode. The interpreter shows the outcome of expressions using the repr() function, a debugging tool that tries to produce Python literal notation wherever possible, using only ASCII characters. For Unicode values, that means any non-printable, non-ASCII character is reflected using escape sequences. This makes the value suitable for copying and pasting without requiring more than an ASCII-capable medium.

The repr() result of a str uses \n for newlines, for example, and \xhh hex escapes for bytes without dedicated escape sequences, outside the printable range. In addition, for unicode objects, codepoints outside the Latin-1 range are represented with \uhhhh and \Uhhhhhhhh escape sequences depending on wether or not they are part of the basic multilingual plane:

>>> u'''\
... A multiline string to show newlines
... can contain £ latin characters
... or emoji !
... '''
u'A multiline string to show newlines\ncan contain \xa3 latin characters\nor emoji \U0001f4a9!\n'
>>> print _
A multiline string to show newlines
can contain £ latin characters
or emoji !
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • OK thank you. So `u'£' + '1'` works because the `'1'` can be decoded as UTF-8? – texasflood Aug 02 '15 at 12:34
  • But then `u'£' + '1'` returns a unicode object, so how does it combine the ASCII and UTF-8 objects? I would have thought that it would turn `'1'` into its UTF-8 equivalent, then concatenated two UTF-8 objects, which is trivial – texasflood Aug 02 '15 at 12:41
  • 1
    There is no UTF-8 object. You have a *Unicode* object. UTF-8 is a codec, a way to encode Unicode codepoints to bytes, it is not the same thing as the Unicode data itself, just like using ISO 8601 notation to write down a date and a time is not the same thing as the timestamp itself. – Martijn Pieters Aug 02 '15 at 12:42
  • @texasflood: That Python had to decode from UTF-8 to produce the `unicode` object is neither here nor there. The `'1'` is implicitly decoded from ASCII because you tried to concatenate it with a `unicode` object. – Martijn Pieters Aug 02 '15 at 12:43