How do I use characters that are not supported by default in Python?

Question

Possible Duplicate:
UnicodeDecodeError, invalid continuation byte

I am trying to use characters such as á and ô in a python generated PDF. The program uses the dateutil module (and several others) to generate a PDF calendar (with images). The calendar is laid out with Latex.

My aim is to create calendars in French, but the array of month names in French includes characters which appear to be not recognised by python.

During generation the program prints SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe9 in position 0: unexpected end of data in the command line.

How can I tell python to use these characters?

If it helps the array is:

FRENCH_MONTHS = [u'NotAMonth', u'Janvier', u'Février', u'Mars', u'Avril', u'Mai', u'Juin', u'Juillet', u'Aôut', u'Septembre', u'Octobre', u'Novembre', u'Décembre']

Also used:

MPS to PDF converter

Where does the problem lie? In writing the characters in the Python script? Or in adding them to the PDF (i.e., a problem with the PDF library)? Also, which PDF library are you using, and can you edit your question to include the code you've written so far? — Blair, Dec 04 '12 at 07:24
@user1833746 so I have heard. It may be some other problem in how the program is doing thing, or what I have done along the way. But I did not write the program and do not have sufficent knowledge depth in python to find out how the program does what it does. — damned truths, Dec 04 '12 at 07:49

Martijn Pieters · Answer 1 · 2012-12-04T08:08:36.053

3

You have to specify the source encoding used to create your python source. Do this by adding a source encoding declaration:

# coding: UTF-8

This should be the first or second line of your python source file. The encoding has to match the encoding you saved the file in; check your text editor settings. The error message you added to your question indicates that the encoding doesn't match, I suspect you used latin-1 (ISO 8859-1) instead.

Alternatively, use unicode escapes to include non-ASCII characters; u'\u00e9' represents the e with accent grave in a as a unicode code point.

Please do study up on how Python handles Unicode in the Python Unicode HOWTO. The Joel Spolsky Unicode article is also essential reading for any software developer dealing with any non-ASCII data.

edited Dec 04 '12 at 08:08

answered Dec 04 '12 at 07:20

Martijn Pieters

1,048,767
296
4,058
3,343

@ If the `coding`line was the error, it would report a different exception. – glglgl Dec 04 '12 at 07:48
@glglgl Indeed, this answer was written before the question included the all-important error message! – dbr Dec 04 '12 at 07:59
@glglgl: The question was edited. – Martijn Pieters Dec 04 '12 at 08:11
I tested that line in a file and got a `SyntaxError: Non-ASCII character '\xc3' in file test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details` which is a different text, thus my comment. But maybe it is just because of different interpreter versions? – glglgl Dec 04 '12 at 08:21
@glglgl: The OP had not reported any *proper* error *including a traceback* yet. :-P But yes, it's probably interpreter dependent on what the error reported is. – Martijn Pieters Dec 04 '12 at 08:24

score 2 · Answer 2 · edited May 23 '17 at 12:12

You can either use the # coding: UTF-8 if your editor is setup to use save the file with the correct encoding

Alternatively you can also encode the characters in to their ASCII-only escape sequence. For example, the escape-sequence for é is \u00E9:

FRENCH_MONTHS = [u'NotAMonth', u'Janvier', u'F\u00E9vrier', ...]

This is less likely to be messed up by a badly configured editor, but achieves the exact same thing.

Even better, you could use the calendar module and sidestep the entire issue (based on this answer):

import calendar


def get_month_names(locale):
    with calendar.TimeEncoding(locale) as encoding:
        months = list(calendar.month_name)

        # Could do this to match original values:
        # months = [x.title() for x in months]

        if encoding is not None:
            months = [x.decode(encoding) for x in months]

        months[0] = u"NotAMonth"
        return months

FRENCH_MONTHS = get_month_names("fr_FR.UTF-8")

Edit: This is the same problem as this question - your é is encoded with latin 1, but your Python source-file encoding is UTF-8 (either explicitly set in Python 2, or because it's the default in Python 3):

>>> print "\xe9".decode("latin1")
é
>>> print "\xe9".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data

Another good reason for using one of the alternative solutions above!

How do I use characters that are not supported by default in Python?

2 Answers2