7

Basically I just want to be able to create instances using a class called Bottle: eg class Bottle(object):... and then in another module be able to simply "print" any instance without having to hack code to explicitly call a character encoding routine.

In summary, when I try:

obj=Bottle(u"味精")
print obj

Or to an "in place" "print":

print Bottle(u"味精")

I get:

"UnicodeEncodeError: 'ascii' codec can't encode characters"

Similar stackoverflow questions:

¢ It's currently not feasible to switch to python3. ¢

A solution or hint (and explanation) on how to do an in place utf-8 print (just like class U does successfully below) would be muchly appreciated. :-)

ThanX N

--

Sample code:

-------- 8>< - - - - cut here - - - -

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def setdefaultencoding(encoding="utf-8"):
  import sys, codecs

  org_encoding = sys.getdefaultencoding()
  if org_encoding == "ascii": # not good enough
    print "encoding set to "+encoding
    sys.stdout = codecs.getwriter(encoding)(sys.stdout)
    sys.stderr = codecs.getwriter(encoding)(sys.stderr)

setdefaultencoding()

msg=u"味精" # the message!

class U(unicode): pass

m1=U(msg)

print "A)", m1 # works fine, even with unicode, but

class Bottle(object):
  def __init__(self,msg): self.msg=msg
  def __repr__(self): 
    print "debug: __repr__",self.msg
    return '{{{'+self.msg+'}}}'
  def __unicode__(self): 
    print "debug: __unicode__",self.msg
    return '{{{'+self.msg+'}}}'
  def __str__(self): 
    print "debug: __str__",self.msg
    return '{{{'+self.msg+'}}}'
  def decode(self,arg): print "debug: decode",self.msg
  def encode(self,arg): print "debug: encode",self.msg
  def translate(self,arg): print "debug: translate",self.msg

m2=Bottle(msg)

#print "B)", str(m2)
print "C) repr(x):", repr(m2)
print "D) unicode(x):", unicode(m2)
print "E)",m2 # gives:  UnicodeEncodeError: 'ascii' codec can't encode characters

-------- 8>< - - - - cut here - - - - Python 2.4 output:

encoding set to utf-8
A) 味精
C) repr(x): debug: __repr__ 味精
{{{\u5473\u7cbe}}}
D) unicode(x): debug: __unicode__ 味精
{{{味精}}}
E) debug: __str__ 味精
Traceback (most recent call last):
  File "./uc.py", line 43, in ?
    print "E)",m2 # gives:  UnicodeEncodeError: 'ascii' codec can't encode characters
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128)

-------- 8>< - - - - cut here - - - - Python 2.6 output:

encoding set to utf-8
A) 味精
C) repr(x): debug: __repr__ 味精
Traceback (most recent call last):
  File "./uc.py", line 41, in <module>
    print "C) repr(x):", repr(m2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128)
Community
  • 1
  • 1
NevilleDNZ
  • 1,269
  • 12
  • 31
  • the output doesn't reflect the code e.g., `repr(x)` should produce `UnicodeEncodeError` too – jfs Nov 22 '11 at 05:29
  • @J.F. Sebastian : Python 2.4 does produce the above, so I ran the code on 2.6 and repr now also produces error message. – NevilleDNZ Nov 22 '11 at 05:39
  • The message "UnicodeEncodeError: 'ascii' codec can't encode characters" makes me suspect that "print" does not use "sys.stdout" as I changed this file's codec/encoding to "utf-8" with "sys.stdout = codecs.getwriter(encoding)(sys.stdout)" – NevilleDNZ Nov 22 '11 at 05:51
  • On 2nd thought, the problem isn't in "print" because 'print U(u"味精 ")' works fine! I must have to define "__str__" in some special way. – NevilleDNZ Nov 22 '11 at 06:05
  • 3
    From what I see in CPython sources, `print` handles unicode strings but will not call `__unicode__` itself. Only `__str__` or `__repr__`. – yak Nov 22 '11 at 06:35

1 Answers1

6

if you use sys.stdout = codecs.getwriter(encoding)(sys.stdout) then you should pass Unicode strings to print:

>>> print u"%s" % Bottle(u"魯賓遜漂流記")
debug: __unicode__ 魯賓遜漂流記
{{{魯賓遜漂流記}}}

As @bobince points out in the comments: avoid changing sys.stdout in such manner otherwise it might break any library code that works with sys.stdout and doesn't expect to print Unicode strings.

In general:

__unicode__() should return Unicode strings:

def __init__(self, msg, encoding='utf-8'):
    if not isinstance(msg, unicode):
       msg = msg.decode(encoding)
    self.msg = msg

def __unicode__(self):
    return u"{{{%s}}}" % self.msg

__repr__() should return ascii-friendly str object:

def __repr__(self):
    return "Bottle(%r)" % self.msg

__str__() should return str object. Add optional encoding to document what encoding is used. There is no good way to choose encoding here:

def __str__(self, encoding="utf-8")
    return self.__unicode__().encode(encoding)

Define write() method:

def write(self, file, encoding=None):
    encoding = encoding or getattr(file, 'encoding', None)
    s = unicode(self)
    if encoding is not None:
       s = s.encode(encoding)
    return file.write(s)

It should cover cases when the file has its own encoding or it supports Unicode strings directly.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @Robinson Crusoe :-) - ThanX for that! I see also `print unicode(Bottle(u"魯賓遜漂流記")` works. But **strangely** the obvious alternative `print >> sys.stdout, Bottle(u"魯賓遜漂流記")` does not work (even with the code `sys.stdout = codecs.getwriter("utf-8")(sys.stdout)` at the top. – NevilleDNZ Nov 22 '11 at 11:34
  • Be aware, the character encoding of your terminal is a factor here also. Regarding the `print` statement calling `__str__`, I believe this is a bug in the `print` statement. – wberry Nov 22 '11 at 15:36
  • 1
    Be very very careful hacking `sys.stdout` to be a character stream instead of a byte stream. They really aren't interchangeable concepts, so switching them out is fragile. Any library code you are using that tries to write non-ASCII bytes to `sys.stdout` would now fail. And if we are talking about outputting to the Windows Command Prompt, you should just give up now, you won't get Unicode out of it using the standard C stdio libraries that Python (and most other languages) use. – bobince Nov 25 '11 at 00:58
  • @bobince: I agree. I've added explicit warning. – jfs Nov 28 '11 at 03:21
  • @bobince: there is [`win-unicode-console` Python package that translates read/write on standard streams that are connected to Windows console into win32 API calls.](http://bugs.python.org/issue1602) – jfs Oct 08 '14 at 16:20