2

For the past few days I've been learning programing with Python and I'm still but a beginner. Recently, I've used the book 'Code in the cloud' for that purpose. The thing is, while all those textbooks cover a wide area of topics thoroughly they merely touch upon the issue of UTF-8 encoding in languages other than English. Hance my question for you - how to make the following batch of code display utf-8 characters correctly in my mother tongue.

# -*- coding: utf-8 -*-
import datetime
import sys

class ChatError(Exception):
""" Wyjątki obsługujące wszelkiego rodzaju błędy w czacie."""
def __init__(self, msg):
    self.message = msg


# START: ChatMessage
class ChatMessage(object):
"""Pojedyncza wiadomość wysłana przez użytkownika czatu"""
def __init__(self, user, text):
    self.sender = user
    self.msg = text
    self.time = datetime.datetime.now()
def __str__(self):
    return "Od: %s o godzinie %s: %s" % (self.sender.username,
                                   self.time,
                                   self.msg)

# END: ChatMessage

# START: ChatUser
class ChatUser(object):
"""Użytkownik biorący udział w czacie"""
def __init__(self, username):
    self.username = username
    self.rooms = {}

def subscribe(self, roomname):
    if roomname in ChatRoom.rooms:
        room = ChatRoom.rooms[roomname]
        self.rooms[roomname] = room
        room.addSubscriber(self)
    else:
        raise ChatError("Nie znaleziono pokoju %s" % roomname)

def sendMessage(self, roomname, text):
    if roomname in self.rooms:
        room = self.rooms[roomname]
        cm = ChatMessage(self, text)
        room.addMessage(cm)
    else:
        raise ChatError("Użytkownik %s nie jest zarejestrowany w pokoju %s" % 
                        (self.username, roomname))

def displayChat(self, roomname, out):
    if roomname in self.rooms:
        room = self.rooms[roomname]
        room.printMessages(out)
    else:
        raise ChatError("Użytkownik %s nie jest zarejestrowany w pokoju %s" % 
                        (self.username, roomname))
# END: ChatUser

# START: ChatRoom
class ChatRoom(object):
"""A chatroom"""

rooms = {}

def __init__(self, name):
    self.name = name
    self.users = []
    self.messages = []
    ChatRoom.rooms[name] = self

def addSubscriber(self, subscriber):
    self.users.append(subscriber)
    subscriber.sendMessage(self.name, 'Użytkownik %s dołączył do dyskusji.' %
                           subscriber.username)

def removeSubscriber(self, subscriber):
    if subscriber in self.users:
        subscriber.sendMessage(self.name, 
                               "Użytkownik %s opóścił pokój." % 
                               subscriber.username)
        self.users.remove(subscriber)

def addMessage(self, msg):
    self.messages.append(msg)

def printMessages(self, out):
    print >>out, "Lista wiadomości: %s" % self.name
    for i in self.messages:
        print >>out, i
# END: ChatRoom

# START: ChatMain
def main():
room = ChatRoom("Main") 
markcc = ChatUser("MarkCC")
markcc.subscribe("Main")
prag = ChatUser("Prag")
prag.subscribe("Main")

markcc.sendMessage("Main", "Hej! Jest tu kto?")
prag.sendMessage("Main", "Tak, ja tu jestem.")
markcc.displayChat("Main", sys.stdout)


if __name__ == "__main__":
main()
# END: ChatMain

It was taken from the aforementioned book, but I cannot make it display non-English characters correctly in the Windows commandline (even though it supports them). As you can see I added encoding statement (# -- coding: utf-8 -) at the beginning thanks to which the code works at all. I also tried using u"string" syntax but to no avail- it returns the following message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u017c' in position 5
1: ordinal not in range(128)

What to do to make those characters display correctly? Yes, I will often work with strings formated in UTF. I would be very grateful for your help.

Mathias
  • 77
  • 1
  • 2
  • 6
  • The coding statement is for characters user in the file, not that it prints. You need to do something line `print username.decode('utf-8')` to tell Python to decode the string to unicode, then it will encode it correctly automatically – agf Aug 21 '11 at 12:58

4 Answers4

1

This works for me currently:

#!/usr/bin/env python
# -*-coding=utf-8 -*-
gkuzmin
  • 2,414
  • 17
  • 24
1

Try invoking the Python interpreter this way:

#!/usr/bin/python -S

import sys
sys.setdefaultencoding("utf-8")
import site

This will set the global default encoding to utf-8. The usual default encoding is ASCII. This is used when writing string to some output, such as using built-ins like print.

Keith
  • 42,110
  • 11
  • 57
  • 76
  • I guess I am missing something obvious here, so forgive my ignorance. When I use the piece of your code I get: Traceback (most recent call last): File "D:\kody\basechat.py", line 4, in sys.setdefaultencoding("utf-8") AttributeError: 'module' object has no attribute 'setdefaultencoding' – Mathias Aug 21 '11 at 12:23
  • I guess I am missing something obvious here, so forgive my ignorance. When I use the piece of your code I get: Traceback (most recent call last): File "D:\kody\basechat.py", line 4, in sys.setdefaultencoding("utf-8") AttributeError: 'module' object has no attribute 'setdefaultencoding' I guess the issue is with the path to Python, though I can't make it work out correctly. If that's the case, when, on my Windows machine, I put the code in D:\kody\basechat.py and have Python installed in D:\Python 2.5.4, what is the right path? – Mathias Aug 21 '11 at 12:29
  • As I remember, Python removes `setdefaultencoding` from `sys` after running `site` so you have to call `reload(sys)` immediately after `import sys` if you want to use it outside `site`. – ssokolow Aug 21 '11 at 12:49
  • @ssokolow Yeah, I figured that out, but unfortynately this tip doesn't work (or I am doing something wrong). Basically it does the same as # -- coding: utf-8 - line - the code compiles, and displays output, but the output is full of random characters in places where there should be non-English letters. Without either of them it throws ASCII exception. – Mathias Aug 21 '11 at 13:11
  • @Mathias Yes, you missed the part where Python is invoked with the `-S` option (don't import site module). Then you call setdefaultencoding, then explicitly import site afterwards. The reason for this is the site module removes the setdefaultencoding method after it is used once (so it can't be changed later). – Keith Aug 21 '11 at 23:27
  • I forgot to mention that this is how it would be done on Linux/Unix (the #! line), but I'm not sure how to change the invocation on Windows. I hope you can adapt it. – Keith Aug 21 '11 at 23:30
0

Okay, I know nothing about python, and little about the windows command-line, but a little Googling and:

I think the problem is that the windows cmd shell doesn't support utf-8. If I'm not wrong, this should give you more understanding about the error:
http://wiki.python.org/moin/PrintFails

(Got that link from this question:' Unicode characters in Windows command line - how?).

It looks like you can force python into thinking it can print UTF8 using PYTHONIOENCODING.

This question is about finding utf8 enabled windows shells:
Is there a Windows command shell that will display Unicode characters?

May be helpful. Hope you solve your problem.

Community
  • 1
  • 1
Robin Winslow
  • 10,908
  • 8
  • 62
  • 91
  • I wish that was it. It would make everything easy. The thing is I can write utf-8 characters in the console no problem. It's just displaying the in them right way in Python that doesn't work. – Mathias Aug 21 '11 at 12:36
  • 1
    @Mathias: I notice that `putty` handles UTF-8 just fine. It isn't Python's job to display them right. That is the job of your terminal program. – tchrist Aug 21 '11 at 19:19
0

The Windows terminal sometimes uses a non-UTF-8 encoding (python: unicode in Windows terminal, encoding used?). You therefore might want to try the following:

stdout_encoding = sys.stdout.encoding


def printMessages(self, out):
    print >>out, ("Lista wiadomości: %s" % self.name).decode('utf-8').encode(stdout_encoding)
    for i in self.messages:
        print >>out, i.decode('utf-8').encode(stdout_encoding)

This takes your byte strings, turns them into character strings (your file indicates that they are encoded in UTF-8), and then encodes them for your terminal.

You can find useful information about the general issue of encoding and decoding on StackOverflow.

Community
  • 1
  • 1
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260