4

I'm administering some Python code in which I now see an error in the logs:

Traceback (most recent call last):
  File "./app/core.py", line 772, in scrapeEmail
    l.info('EMAIL SUBJECT: ', header['value'])
  File "./app/__init__.py", line 44, in info
    logging.info(str(datetime.utcnow()) + ' INFO     ' + caller.filename + ':' + str(caller.lineno) + ' - ' + ' '.join([str(x) for x in args]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 25: ordinal not in range(128)

which I guess means that header['value'] contains differently encoded characters.

I searched around, and this SO answer suggests to "put .encode('utf-8') at the end of the object for recent versions of Python".

This raised two questions for me:

  1. On what object do I need to use .encode('utf-8'). On x or on str(x). So should it be str(x.encode('utf-8')) or on str(x).encode('utf-8')?
  2. What does the writer mean with "recent versions of Python"? Can I still use .encode('utf-8') in Python 2.7?

Normally I would simply try it, but it is not easy (actually impossible) to find the string on which the error occurred. So I can't really test it.

A little help would be greatly appreciated here.

Community
  • 1
  • 1
kramer65
  • 50,427
  • 120
  • 308
  • 488
  • For 1) unless your object x implements method `encode`, you use it on the string (which has a method `.encode`) – DainDwarf Dec 08 '15 at 13:55
  • 1
    That answer is not relevant to you; randomly putting encode on the end of string calls is unlikely to help. The problem is more likely that you have overridden the `info` method with your own implementation, which does not do the right thing. The decision about what to put in a log message belongs to the [formatter](https://docs.python.org/2/library/logging.html#logging.Formatter), not a logger subclass. – Daniel Roseman Dec 08 '15 at 14:01
  • Have you tried using unicode('something') instead str('something')? – pazitos10 Dec 08 '15 at 14:06

1 Answers1

7

I suggest that you should get clearly known about the relationship between unicode and other coding format(e.g GB2312, GBK) firstly. And soon there is no major problem on encoding and decoding:)

The following diagram will show you the relationship, once you got the main point on it, you will know when and how to do the encode and decode in your code. :)

---------              -----------             ----------
|       |  1.decode(A) |         | 2.encode(B) |        |
|   A   | -----------> | unicode | ----------->|   B    |
|       | <----------- |         | <---------- |        |
|       |  4.encode(A) |         | 3.decode(B) |        |
---------              -----------             ----------

So, according to the diagram, you should know what encoding is now, and what encoding want to transform, and then follow the relationship as diagram shows.

Ryan Chou
  • 1,086
  • 11
  • 21