1

unicode and string encoding still gives me some headache. I followed this question / answer to be able to add special characters (äÄÜ..) to a message.

For the following structure I have trouble to understand why version 2 works and version 1 does not.

My model:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

class Project(models.Model):
    """
    Representation of a project
    """

    name = models.CharField(max_length=200)

    def __unicode__(self):
            return '%s ' % (self.name)

Version 1:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

def print_project(self, project):
        project_prefix = "Project: "
        print (project_prefix + str(project))

Version 2:

 # -*- coding: utf-8 -*-

def print_project(self, project):
        project_prefix = "Project: "
        print (project_prefix + str(project))

As you see the only difference is that I do this from __future__ import unicode_literals import. The error thrown is the following:

'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Community
  • 1
  • 1
Thomas Kremmel
  • 14,575
  • 26
  • 108
  • 177

1 Answers1

3

After that __future__ statement, your literals are not str objects, but unicode objects. That's the whole point of the statement. That isn't described too well, either in the __future__ docs or in PEP 3112 which they refer to (which spends most of its time talking about how to write Python 2-style bytes objects, given that string literals are now Unicode). But that's what it does.

You can test this in the interactive interpreter:

>>> 'abc'
'abc'
>>> from __future__ import unicode_literals
>>> 'abc'
u'abc'

So, in version 2, you're adding two str objects together, which is easy. But in version 1, you're adding a unicode and a str. This works by automatically converting the str to a unicode using the default encoding, which is ASCII, which doesn't work.


The easiest way to fix this is to make project be a unicode itself:

def print_project(self, project):
    project_prefix = "Project: "
    print (project_prefix + unicode(project))

This will, in fact, work with or without the __future__ statement—with it, project_prefix is already unicode; without it, it's a str and will be decoded from ASCII, but that's fine, because it is ASCII.

If you want to use non-ASCII literals (in the project_prefix), and you want your code to work with and without the __future__ statement, you will have to manually decode:

def print_project(self, project):
    project_prefix = "Project: ".decode('utf-8')
    print (project_prefix + unicode(project))

(Make sure to match the source file's coding declaration, of course.)


In a comment, you ask:

when using the __future__ import statement do I still have to define the coding at the beginning of the .py file? # -- coding: utf-8 --

The short answer is yes.

I don't know if the documentation directly covers this anywhere, but if you think about it, there's no other way it could work.

In order to interpret literals in your 8-bit source code as Unicode, the Python compiler has to decode them. The only way it knows what to decode them from is your coding declaration.

Another way to look at this is that the __future__ statement makes Python 2 work like Python 3 as far as string literals are concerned, and Python 3 needs coding declarations.

If you want to test this for yourself, copy the following as UTF and paste it into a text file. (Note that you have to use an editor that doesn't understand coding declarations to do this—something like emacs may convert your UTF-8 text to Latin-1 on saving!).

# -*- coding: latin-1 -*-
from __future__ import unicode_literals
print repr('é')

When you run this, it will print out u'\xc3\xa9', not u'\xe9'.

While Python 3 defaults to UTF-8 if you don't specify a coding, Python 2.5-2.7 defaults to ASCII, even with unicode_literals. So, you still need the coding declaration. (It's always safe to add, even in 3.x, and it also makes many programmers' text editors happy, so it maybe a habit worth keeping until we get far enough into the future that nobody remembers Latin-1 and Shift-JIS and cp1250 and so on.)

Thomas Kremmel
  • 14,575
  • 26
  • 108
  • 177
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thanks for this perfect answer! I already checked it and it works. I will go with the unicode(project) approach. Just one more question that pops up in my mind.. when using the __future__ import statement do I still have to define the coding at the beginning of the .py file? # -*- coding: utf-8 -*- – Thomas Kremmel Jun 11 '13 at 20:51
  • Have you read the [2.x](http://docs.python.org/2/howto/unicode.html) and [3.x](http://docs.python.org/3/howto/unicode.html) HOWTO files on Unicode? (If you're using `unicode_literals` in 2.x, you kind of need both, but at least you can skim over the parts that are repeated.) – abarnert Jun 11 '13 at 21:33
  • Thanks. Just read it now. Worth a read! Now I think I'm ready to deal with all the characters out there :) – Thomas Kremmel Jun 12 '13 at 07:14
  • @Tom: Glad you found that one. I'll check it out, and hopefully I'll have something new to recommend to other people with similar questions in the future. – abarnert Jun 12 '13 at 19:13