3

What I know is:

  1. # -*- coding: utf-8 -*-
    It is used to declare the encoding of a Python source file, once I set the encoding name, Python parser will interpret the file using the given encoding. I call it "file encoding";

  2. from __future__ import unicode_literals I'm doing my tasks using Python2.7, and I use from __future__ import unicode_literals to change the default type of string from "str" to "unicode". I call it "string encoding";

  3. sys.setdefaultencoding('utf8') But sometimes, I get an error in Django, for example, I stored Chinese in admin, then I visited the releated pages

    UnicodeEncodeError at /admin/blog/vulpaper/29/change/
    'ascii' codec can't encode characters in position 6-13: ordinal not in range(128)
    ....the more error information
    The string that could not be encoded/decoded was: emcms外贸网站管理系统

    for this problem, I will write sys.setdefaultencoding('utf8') in Django settings file to solve it.

But Actually, I don't know the tech detail of the above.

What make me confused is:
1. Since I set the python source file encoding, why should I set the string encoding to ensure my string's encoding is my favorite encoding?
What's the different between "file encoding" and "string encoding"?
2. Since I set the "file encoding" and "string encoding", why do UnicodeEncodeError still happen?

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
Alvin
  • 31
  • 5
  • Because the encoding of the python source code file, and how Python will encode strings itself are two different things. You can construct a python file in utf-8, but write ASCII strings. – Willem Van Onsem May 29 '18 at 08:21
  • 2
    Setting `sys.setdefaultencoding` is unlikely to be the right answer to anything. If you're getting that error, you probably have a mistake in your `__unicode__` method. – Daniel Roseman May 29 '18 at 08:27
  • @DanielRoseman Thx, you mentioned me, I search the related question and found that almost no one recommend newbie to use `sys.setdefaultencoding` in projects, so I remove it from my settings' file. First I check the Chinese string data type what I ready to store, it's unicode, then I check the model, when I change the instance methods from `__str__` to `__unicode__`, it works! I avoid the UnicodeEncodeError – Alvin May 30 '18 at 02:14

1 Answers1

2

Usually you have to use both file encoding and literal strings encoding but they actually control something very different and it is helpful to know the difference.

File Encoding

If you expect to write unicode characters in your source code in any place like comments or literal strings, you need to change the encoding in order for the python parser to work. Setting the wrong encoding will result in SyntaxError exception. PEP 263 explains the problem in detail and how you can control the encoding of the parser.

In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries.

...

Python will default to ASCII as standard encoding if no other encoding hints are given.

Unicode Literal Strings

Python 2 uses two different types for strings, unicode and str. When you define a literal string the interpreter actually creates a new object of type str that holds this literal.

s = "A literal string"
print type(s)

<type 'str'>

TL;DR

If you want to change this behavior and instead create unicode object every time an unprefixed string literal is defined, you can use from __future__ import unicode_literals

If you need to understand why this is useful keep reading.

You can explicitly define a literal string as unicode using the u prefix. The interpreter will create instead a unicode object for this literal.

s = u"A literal string"
print type(s)

<type 'unicode'>

For ASCII text, using str type is sufficient but if you intend to manipulate non-ASCII text it is important to use unicode type for character level operations to work correctly. The following example shows the difference of character level interpretation using str and unicode for exactly the same literal.

# -*- coding: utf-8 -*-

def print_characters(s):
    print "String of type {}".format(type(s))
    print "  Length: {} ".format(len(s))
    print "  Characters: " ,
    for c in s:
        print c,
    print
    print


u_lit = u"Γειά σου κόσμε"
s_lit = "Γειά σου κόσμε"

print_characters(u_lit)
print_characters(s_lit)

Output:

String of type <type 'unicode'>
  Length: 14 
  Characters:  Γ ε ι ά   σ ο υ   κ ό σ μ ε

String of type <type 'str'>
  Length: 26 
  Characters:  � � � � � � � �   � � � � � �   � � � � � � � � � �

Using str it erroneously reported that it is of 26 characters length and iterating over character returned garbage. On the other hand unicode worked as expected.

Setting sys.setdefaultencoding('utf8')

There is a nice answer in stack overflow about why we shouldn't use it :)

Kon Pal
  • 546
  • 1
  • 3
  • 13