How to properly handle non ASCII strings in python

Question

I'm building an application that in the database has data with latin symbols. Users are able to enter this data. What I've been doing so far is encode('latin2') every user input and decode('latin2') at the very end when displaying data in the template. This is a bit annoying and I'm wondering if there is any better way of handling this.

generally you want to store data as unicode and transform to/from unicode at input/output points. user_input --> decode --> unicode --> encode --> user_output. — monkut, Oct 03 '12 at 03:03
What do you mean "on the back end has data with latin symbols"? Is this data in a database? — Mu Mind, Oct 03 '12 at 03:13
@monkut Should I then make sure that data in database is stored in unicode? — marcin_koss, Oct 03 '12 at 03:39
"The only problem left is that data also contains html markup like
and
tags which get decoded and display as part of the text." What web framework are you using? — Mu Mind, Oct 03 '12 at 04:29
Updated my answer to explain a bit about escaping markup (great choice on Flask, btw!) — Mu Mind, Oct 03 '12 at 04:48
yes, on user input convert to unicode, and store in db as unicode. — monkut, Oct 03 '12 at 05:30

score 2 · Answer 1 · edited May 23 '17 at 12:04

Python's unicode type is designed to be the "natural" representation for strings. Besides the unicode type, strings are expected to be in some unspecified encoding but there's no way to "tag" them with the encoding used, and python will very insistently assume that strings are in ASCII or UTF-8 encoding. As such, you're probably asking for headaches if you write your whole program to assume that str means latin2. Encoding problems have a way of creeping in at odd places in the code and percolating through layers, sometimes getting bad data in your database, and ultimately causing odd behavior or nasty errors somewhere completely unrelated and impossible to debug.

I would recommend you see about converting your db data to UTF-8.

If you can't do that, I would strongly recommend moving your encoding/decoding calls right up to the moment you transmit data to/from the database. If you have any sort of database abstraction layer, you can probably configure it to handle that for you more or less automatically. Then you should make sure any user input is converted to the unicode type right away.

Using unicode types and explicitly encoding/decoding this way also has the advantage that if you do have encoding problems, you will probably notice sooner and you can just throw unicode-nazi at them to track them down (see How can you make python 2.x warn when coercing strings to unicode?).

For your markup problem: Flask and Jinja2 will by default escape any unsafe characters in your strings before rendering them into your HTML. To override the autoescaping, just use the safe filter:

<h1>More than just text!</h1>
<div>{{ html_data|safe }}</div>

See Flask Templates: Controlling Autoescaping for details, and use this with extreme caution since you're effectively loading code from the database and executing it. In real life, you'll probably want to scrub the data (see Python HTML sanitizer / scrubber / filter or Jinja2 escape all HTML but img, b, etc).

I converted database tables to utf8_unicode_ci. Data is now being returned in utf8 and I use decode('utf8') in the template. The only problem left is that data also contains html markup like
and
tags which get decoded and display as part of the text. — marcin_koss, Oct 03 '12 at 04:25
I would add the obligatory link to [the joelonsoftware article _"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"_](http://www.joelonsoftware.com/articles/Unicode.html) — Burhan Khalid, Oct 03 '12 at 04:54
@MuMind "safe" filter worked perfectly. For some reason I didn't think the solution will be in jinja. As for the Flask I just love the simplicity of the framework. Thank you so much for your help! — marcin_koss, Oct 03 '12 at 04:54

score 1 · Answer 2 · answered Oct 03 '12 at 03:07

1

try add this to the top of your program.

 import sys
 reload(sys)
 sys.setdefaultencoding('latin2')

We have to reload sys because:

>>> import sys
>>> sys.setdefaultencoding
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys) 
<module 'sys' (built-in)>
>>> sys.setdefaultencoding
<built-in function setdefaultencoding>

answered Oct 03 '12 at 03:07

Marcus

6,701
4
19
28

Found this presentation where author is not recommending using setdefaultencoding() - http://farmdev.com/talks/unicode/ – marcin_koss Oct 03 '12 at 03:24
That presentation warns that setting the default encoding can break some third party modules, depending on what you're doing, it's probably not a significant problem. – Perkins Oct 03 '12 at 04:39

How to properly handle non ASCII strings in python

2 Answers2