Why don't python interpreter use the file coding format for decoding?

Question

The code bellow will cause an UnicodeDecodeError:

#-*- coding:utf-8 -*-   
s="中文"                                 
u=u"123"                                            
u=s+u

I know it's because python interpreter is using ascii to decode s.

Why don't python interpreter use the file format(utf-8) for decoding?

what does `u = s.decode("utf8") + u` output? – Padraic Cunningham Jul 25 '14 at 10:35 — Padraic Cunningham, Jul 25 '14 at 10:35
Check http://stackoverflow.com/a/18034409/1860929 – Anshul Goyal Jul 25 '14 at 10:39 — Anshul Goyal, Jul 25 '14 at 10:39

score 1 · Answer 1 · answered Jul 25 '14 at 10:35

1

The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.

So, instead of doing s="中文", do as following to get unicode strings for both:

s=u"中文"
u=u"123"
u=s+u

answered Jul 25 '14 at 10:35

Anshul Goyal

73,278
37
149
186

score 1 · Answer 2 · answered Jul 25 '14 at 10:38

The code works perfectly fine on Python 3.

However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.

You can add a line

from __future__ import unicode_literals

after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.

score 1 · Accepted Answer · answered Jul 25 '14 at 10:54

Implicit decoding cannot know what source encoding was used. That information is not stored with strings.

All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.

As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()

u = s.decode('utf8') + u

I knew the difference between unicode and string, and I knew how to solve the problem I occurred. I just want to know why not python interpreter use the file format to decode a string to unicode. So I think this answer is best. — WKPlus, Jul 26 '14 at 15:27

Why don't python interpreter use the file coding format for decoding?

3 Answers3