1

Ok, I just want to understand. After more than 1 hour debugging an entry point, after testing the api a dozen different times with Postman and making sure it works locally and then again, getting a weird Unicode error on production. I found that if I remove a print statement, it works.

This is the relevant code of my entry point:

@csrf_exempt
def create_books(request):

    sent_json = request.body

    if not sent_json:
        return HttpResponse("No json in request.body", status=404)

    sent_json = json.loads(sent_json)
    books = sent_json['books']

    print "books: %s" % books
    for num, book in books.iteritems():
        title = book['title']
        writer = book['writer']
        if Book.objects.filter(titulo=title, writer=writer).exists():
            book = Book.objects.get(titulo=title, writer=writer)
        else:
            book = Book.objects.create(titulo=title, writer=writer)

        print "book.title: %s" % book.title  # !!! ERROR

So ... when I print the dict like this:

 print "books: %s" % books

Everything is fine, but when I print the book.title

 print "book.title: %s" % book.title  # !!! ERROR

I get a Unicode Error. The title that causes the error is of course contained in the books dictionary. But why does it give an error after it has being saved to the database and called as an attribute of the object?.

After I removed the 2nd print, everything was solved. But I don't understand.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
Alejandro Veintimilla
  • 10,743
  • 23
  • 91
  • 180

3 Answers3

0

TLDR: To solve the problem you need to encode the string before passing it to print:

print "book.title: %s" % book.title.encode('utf-8')

Answer:

The books representation returned by repr(books) (which is used automatically in print "books: %s" % books) does not have any "special" (non-ascii) characters, since repr() will scale them properly. But the book.title has and repr() will not be used in this case.

If you pass an unicode object to print it will try to encode it as the encoding found on sys.stdout.encoding (if detected) or to ascii if not detected. The best approach is to always encode your data before sending it through the border of your software.

Álvaro Justen
  • 1,943
  • 1
  • 17
  • 17
  • it may produce a mojibake if the environment doesn't use utf-8. [print Unicode directly instead](http://stackoverflow.com/a/35100464/4279) – jfs Jan 30 '16 at 11:32
  • In my opinion the best approach is always encode before sending data out. If `sys.stdout.encoding` is `None` (when you use bash pipes it will be `None` no matter your terminal configurations) and you use unicode than Python will try to use `ascii` codec and your software will break if it has non-ascii characters. – Álvaro Justen Jan 30 '16 at 16:35
  • It won't break. `PYTHONIOENCODING` envvar is configured when the output is redirected. [Mojibake is not some theoretical concern even presidents may encounter it](http://goo.gl/QlkFXZ). To understand the I/O encoding issues better, follow the links that I've provided. – jfs Jan 30 '16 at 17:03
  • If the environment is not configured properly it will break. I prefer to show the message (and not break) in the wrong encoding than breaking the software. – Álvaro Justen Jan 30 '16 at 17:11
  • Two gold rules for working with encoding: 1- decode as soon as the data enters your software, 2- encode as late as possible. – Álvaro Justen Jan 30 '16 at 17:12
  • It is called Unicode sandwich. And printing Unicode directly follows *"as late as possible"* rule (if Unicode API is used such as in `win-unicode-console` case then there is no conversion to bytes at all—it can't be later than that). It is all good but it is unrelated to your mojibake issue. Software should not corrupt data silently. *In some cases*, `locale.getpreferredencoding()` may be used if `sys.stdout.encoding` is `None` and there is no better default (to configure `PYTHONIOENCODING` or equivalent) in a specific case (like on Python 3). – jfs Jan 30 '16 at 18:20
0

The problem is that the unicode string in book.title can't be encoded to your terminal. You can view your terminal encoding at sys.stdout.encoding to see what it is.

Suppose I have a unicode title (which may or may not display properly in your browser)...

>>> title = u"ༀ༁༂༃༄༅༆༇༈༉༊"
>>> book = { 'title':title }

If I print book, I get a string representation of a dict, which doesn't try to encode the unicode string

>>> print "%s" % book
{'title': u'\u0f00\u0f01\u0f02\u0f03\u0f04\u0f05\u0f06\u0f07\u0f08\u0f09\u0f0a'}

But if I print the string directly, the string is encoded to your local terminal

>>> print "%s" % title
ༀ༁༂༃༄༅༆༇༈༉༊

It worked for me, but your string failed for you. You can solve the problem by doing the decoding yourself and setting up a policy for unprintable characters

>>> print "%s" % title.encode(sys.stdout.encoding, 'replace')
ༀ༁༂༃༄༅༆༇༈༉༊

It all still works for me because I have a utf-8 terminal, but you should see question marks in there.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • [Don't sprinkle your code with .encode() calls; print Unicode directly instead](http://stackoverflow.com/a/35100464/4279) – jfs Jan 30 '16 at 10:12
0

Containers such as dict, list call repr() on their items during printing (str() call) and therefore you don't see any Unicode errors: repr() escapes unprintable (non-ascii on Python 2) characters:

>>> print u"\N{EURO SIGN}"
€
>>> print [u"\N{EURO SIGN}"] # container (list) calls repr(u"€")
[u'\u20ac']
>>> print repr(u"\N{EURO SIGN}")
u'\u20ac'

Don't sprinkle your code with .encode() calls; print Unicode directly instead. If it leads to Unicode errors then fix the environment e.g., configure your locale (default is C (ascii) that you don't want), see LANG, LC_CTYPE, LC_ALL envvars and/or PYTHONIOENCODING envvar (and/or install win-unicode-console on Windows). See:

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670