3

while handling unicode problem, I found that unicode(self) and self.__unicode__() have different behaviour:

#-*- coding:utf-8 -*-
import sys
import dis
class test():
    def __unicode__(self):
        s = u'中文'
        return s.encode('utf-8')

    def __str__(self):
        return self.__unicode__()
print dis.dis(test)
a = test()
print a

the above code works okay, but if I change self.__unicode__() to unicode(self), it will show error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

the code with problem is:

#-*- coding:utf-8 -*-
import sys
import dis
class test():
    def __unicode__(self):
        s = u'中文'
        return s.encode('utf-8')

    def __str__(self):
        return unicode(self)
print dis.dis(test)
a = test()
print a

very curious about how python handle this, I tried dis module but didn't see too many difference:

Disassembly of __str__:
 12           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                0 (__unicode__)
              6 CALL_FUNCTION            0
              9 RETURN_VALUE   

VS

Disassembly of __str__:
 10           0 LOAD_GLOBAL              0 (unicode)
              3 LOAD_FAST                0 (self)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE       
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
springrider
  • 470
  • 1
  • 6
  • 19

4 Answers4

5

You return bytes from your __unicode__ method.

To make it clear:

In [18]: class Test(object):
    def __unicode__(self):
        return u'äö↓'.encode('utf-8')
    def __str__(self):
        return unicode(self)
   ....:     

In [19]: class Test2(object):
    def __unicode__(self):
        return u'äö↓'
    def __str__(self):
        return unicode(self)
   ....:     

In [20]: t = Test()

In [21]: t.__str__()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/home/dav1d/<ipython-input-21-e2650f29e6ea> in <module>()
----> 1 t.__str__()

/home/dav1d/<ipython-input-18-8bc639cbc442> in __str__(self)
      3         return u'äö↓'.encode('utf-8')
      4     def __str__(self):
----> 5         return unicode(self)
      6 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

In [22]: unicode(t)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/home/dav1d/<ipython-input-22-716c041af66e> in <module>()
----> 1 unicode(t)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

In [23]: t2 = Test2()

In [24]: t2.__str__()
Out[24]: u'\xe4\xf6\u2193'

In [25]: str(_) # _ = last result
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/home/dav1d/<ipython-input-25-3a1a0b74e31d> in <module>()
----> 1 str(_) # _ = last result

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)'

In [26]: unicode(t2)
Out[26]: u'\xe4\xf6\u2193'

In [27]: class Test3(object):
def __unicode__(self):
    return u'äö↓'
def __str__(self):
    return unicode(self).encode('utf-8')
....:     

In [28]: t3 = Test3()

In [29]: t3.__unicode__()
Out[29]: u'\xe4\xf6\u2193'

In [30]: t3.__str__()
Out[30]: '\xc3\xa4\xc3\xb6\xe2\x86\x93'

In [31]: print t3
äö↓

In [32]: print unicode(t3)
äö↓

print a or in my case print t will call t.__str__ which es expected to return bytes you let it return unicode so it tries to encode it with ascii which will not work.

Easy fix: let __unicode__ return unicode and __str__ bytes.

dav1d
  • 5,917
  • 1
  • 33
  • 52
4
s = u'中文'
return s.encode('utf-8')

This returns a non-Unicode, byte string. That's what encode is doing. utf-8 is not a thing that magically turns data into Unicode; if anything, it's the opposite - a way of representing Unicode (an abstraction) in bytes (data, more or less).

We need a bit of terminology here. To encode is to take a Unicode string and making a byte string that represents it, using some kind of encoding. To decode is the reverse: taking a byte string (that we think encodes a Unicode string), and interpreting it as a Unicode string, using a specified encoding.

When we encode to a byte string and then decode using the same encoding, we get the original Unicode back.

utf-8 is one possible encoding. There are many, many more.

Sometimes Python will report a UnicodeDecodeError when you call encode. Why? Because you try to encode a byte string. The proper input for this process is a Unicode string, so Python "helpfully" tries to decode the byte string to Unicode first. But it doesn't know what codec to use, so it assumes ascii. This codec is the safest choice, in an environment where you could receive all kinds of data. It simply reports an error for bytes >= 128, which are handled in a gazillion different ways in various 8-bit encodings. (Remember trying to import a Word file with letters like é from a Mac to a PC or vice-versa, way back in the day? You'd get some other weird symbol on the other computer, because the platform built-in encoding was different.)

Making things even more complicated, in Python 2 the encode/decode mechanism is also used to implement some other neat things that have nothing to do with interpreting Unicode. For example, there is a Base64 encoder, and a thing that automatically handles string escape sequences (i.e. it will change a backslash, followed by a letter 't', into a tab). Some of these do "encode" or "decode" from a byte string to a byte string, or from Unicode to Unicode.

(By the way, this all works completely differently - much more clearly, IMHO - in Python 3.)

Similarly, when __unicode__ returns a byte string (which it should not, as a matter of style), the Python unicode() built-in function automatically decodes it as ascii; and when __str__ returns a Unicode string (which again it should not), str() will encode it as ascii. This happens behind the scenes, in code you cannot control. However, you can fix __unicode__ and __str__ to do what they are supposed to do.

(You can, in fact, override the encoding for unicode, by passing a second parameter. However, this is the wrong solution here since you should already have a Unicode string returned from __unicode__. And str doesn't take an encoding parameter, so you're out of luck there.)

So, now we can solve the problem.

Problem: We want __unicode__ to return the Unicode string u'中文', and we want __str__ to return the utf-8-encoded version of that.

Solution: return that string directly in __unicode__, and do the encoding explicitly in __str__:

class test():
    def __unicode__(self):
        return u'中文'

    def __str__(self):
        return unicode(self).encode('utf-8')
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • 1
    Great explanation. Another way of putting it: Unicode is not an encoding. – Daniel Roseman Jun 20 '12 at 10:40
  • 1
    See also: http://stackoverflow.com/questions/447107/whats-the-difference-between-encode-decode-python-2-x/448383#448383 – codeape Jun 20 '12 at 11:01
  • thanks a lot for the explanation! I may misunderstanded about how the overriding works. I always thought that call str(obj) is the same with call obj.__str__(). and unicode(obj) should be the same with obj.__unicode__(). but I guess using str(obj), unicode(obj), the system built-in function did some other stuff first, then pass the control to overriding method such as `__str__` and `__unicode__`, is that right? – springrider Jun 20 '12 at 14:46
0

When you call unicode on a Python object, the output is the unicode representation of the argument you pass to the unicode method.

Since you haven't specified what encoding should be used, you get an error that the argument can't be represented using only ASCII.

When you use __unicode__ you're specifying that utf-8 should be used to encode that string, which is correct and happens with no problems.

You can use the desired encoding as a second parameter to the unicode method, such as:

unicode( str, "utf-8" )

And that should work the same way that your __unicode__ method does.

pcalcao
  • 15,789
  • 1
  • 44
  • 64
0

When you defined the __unicode__ special method you told it what encoding to use. When you simply call unicode you did not specify the encoding, so Python used the default "ascii".

BTW, __str__ should return a string of bytes, not unicode. and __unicode__ should return unicode, not a byte string. So this code is backwards. Since it's not returning a unicode Python is probably trying to convert it using the default encoding.

Keith
  • 42,110
  • 11
  • 57
  • 76