16

I am working through the Django RSS reader project here.

The RSS feed will read something like "OKLAHOMA CITY (AP) — James Harden let". The RSS feed's encoding reads encoding="UTF-8" so I believe I am passing utf-8 to markdown in the code snippet below. The em dash is where it chokes.

I get the Django error of "'ascii' codec can't encode character u'\u2014' in position 109: ordinal not in range(128)" which is an UnicodeEncodeError. In the variables being passed I see "OKLAHOMA CITY (AP) \u2014 James Harden". The code line that is not working is:

content = content.encode(parsed_feed.encoding, "xmlcharrefreplace")

I am using markdown 2.0, django 1.1, and python 2.4.

What is the magic sequence of encoding and decoding that I need to do to make this work?


(In response to Prometheus' request. I agree the formatting helps)

So in views I add a smart_unicode line above the parsed_feed encoding line...

content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict')
content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace") 

This pushes the problem to my models.py for me where I have

def save(self, force_insert=False, force_update=False): 
     if self.excerpt: 
         self.excerpt_html = markdown(self.excerpt) 
         # super save after this 

If I change the save method to have...

def save(self, force_insert=False, force_update=False): 
     if self.excerpt: 
         encoded_excerpt_html = (self.excerpt).encode('utf-8') 
         self.excerpt_html = markdown(encoded_excerpt_html)

I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "\xe2\x80\x94" where the em dash was

user140314
  • 189
  • 1
  • 2
  • 7

3 Answers3

14

If the data that you are receiving is, in fact, encoded in UTF-8, then it should be a sequence of bytes -- a Python 'str' object, in Python 2.X

You can verify this with an assertion:

assert isinstance(content, str)

Once you know that that's true, you can move to the actual encoding. Python doesn't do transcoding -- directly from UTF-8 to ASCII, for instance. You need to first turn your sequence of bytes into a Unicode string, by decoding it:

unicode_content = content.decode('utf-8')

(If you can trust parsed_feed.encoding, then use that instead of the literal 'utf-8'. Either way, be prepared for errors.)

You can then take that string, and encode it in ASCII, substituting high characters with their XML entity equivalents:

xml_content = unicode_content.encode('ascii', 'xmlcharrefreplace')

The full method, then, would look somthing like this:

try:
    content = content.decode(parsed_feed.encoding).encode('ascii', 'xmlcharrefreplace')
except UnicodeDecodeError:
    # Couldn't decode the incoming string -- possibly not encoded in utf-8
    # Do something here to report the error
Ian Clelland
  • 43,011
  • 8
  • 86
  • 87
4

Django provides a couple of useful functions for converting back and forth between Unicode and bytestrings:

from django.utils.encoding import smart_unicode, smart_str

nikola
  • 2,241
  • 4
  • 30
  • 42
  • 1
    Using... content = smart_unicode(content, encoding='utf-8', strings_only=False, errors='strict') content = content = content.encode(parsed_feed.encoding, "xmlcharrefreplace") pushes the problem to my models.py for me where I have def save(self, force_insert=False, force_update=False): if self.excerpt: self.excerpt_html = markdown(self.excerpt) # super save after this If I change the save method to have encoded_excerpt_html = (self.excerpt).encode('utf-8') self.excerpt_html = markdown(encoded_excerpt_html) – user140314 Mar 25 '10 at 05:00
  • Part 2: I get the error "'ascii' codec can't decode byte 0xe2 in position 141: ordinal not in range(128)" because now it reads "\xe2\x80\x94" where the em dash was. – user140314 Mar 25 '10 at 05:01
  • Could you please amend your original post with the above? It's very difficult to read without proper formatting. – nikola Mar 25 '10 at 08:00
0

I encountered this error during a write of a file name with zip file. The following failed

ZipFile.write(root+'/%s'%file, newRoot + '/%s'%file)

and the following worked

ZipFile.write(str(root+'/%s'%file), str(newRoot + '/%s'%file))
highvelcty
  • 131
  • 1
  • 4
  • 3
    Calling `str()` on a unicode value with non-ASCII characters would result in the exact same error the OP is seeing. – Martijn Pieters Sep 25 '12 at 15:00
  • @MartijnPieters: Hi, that is a very important point that you make. I can find no reference to what `str()` is actually doing in [the fine manual](http://docs.python.org/2/library/functions.html#str) however I attribute that to me being a Python noob more than a fault of the manual. Where is this documented, what exactly is `str()` doing to the argument, and what exactly does `str()` return? Thanks! – dotancohen Jun 12 '13 at 07:59
  • `str()` returns a *byte string*; characters with values between 0 and 255, with 0-127 usually interpreted and displayed as ASCII characters. A `unicode()` value, on the other hand, can represent any codepoint in the Unicode standard, between 0 and 1114111. So using `str(unicodevalue)` to turn unicode into a byte string is going to involve *some* transformation. – Martijn Pieters Jun 12 '13 at 12:29
  • The `unicode` type is implemented in C, but it provides the C API equivalent of the `__str__` hook to make that transformation; the [implementation](http://hg.python.org/cpython/file/ca8e86711403/Objects/unicodeobject.c#l7553) calls [`PyUnicode_AsEncodedString()`](http://hg.python.org/cpython/file/ca8e86711403/Objects/unicodeobject.c#l1291), and that function uses `PyUnicode_GetDefaultEncoding()`; guess what that function does. :-) – Martijn Pieters Jun 12 '13 at 12:31
  • 2
    Since you cannot pass in an encoding to `str()`, Python does not have a choice but to use the default encoding. So it is always **much better** to explicitly encode to a byte string, when you need the latter. Don't use `str(unicodevalue)`; it rarely is a good idea. – Martijn Pieters Jun 12 '13 at 12:32