I'm building an application that in the database has data with latin symbols. Users are able to enter this data. What I've been doing so far is encode('latin2') every user input and decode('latin2') at the very end when displaying data in the template. This is a bit annoying and I'm wondering if there is any better way of handling this.
2 Answers
Python's unicode
type is designed to be the "natural" representation for strings. Besides the unicode
type, strings are expected to be in some unspecified encoding but there's no way to "tag" them with the encoding used, and python will very insistently assume that strings are in ASCII or UTF-8 encoding. As such, you're probably asking for headaches if you write your whole program to assume that str
means latin2. Encoding problems have a way of creeping in at odd places in the code and percolating through layers, sometimes getting bad data in your database, and ultimately causing odd behavior or nasty errors somewhere completely unrelated and impossible to debug.
I would recommend you see about converting your db data to UTF-8.
If you can't do that, I would strongly recommend moving your encoding/decoding calls right up to the moment you transmit data to/from the database. If you have any sort of database abstraction layer, you can probably configure it to handle that for you more or less automatically. Then you should make sure any user input is converted to the unicode
type right away.
Using unicode
types and explicitly encoding/decoding this way also has the advantage that if you do have encoding problems, you will probably notice sooner and you can just throw unicode-nazi at them to track them down (see How can you make python 2.x warn when coercing strings to unicode?).
For your markup problem: Flask and Jinja2 will by default escape any unsafe characters in your strings before rendering them into your HTML. To override the autoescaping, just use the
safe
filter:
<h1>More than just text!</h1>
<div>{{ html_data|safe }}</div>
See Flask Templates: Controlling Autoescaping for details, and use this with extreme caution since you're effectively loading code from the database and executing it. In real life, you'll probably want to scrub the data (see Python HTML sanitizer / scrubber / filter or Jinja2 escape all HTML but img, b, etc).
-
I converted database tables to utf8_unicode_ci. Data is now being returned in utf8 and I use decode('utf8') in the template. The only problem left is that data also contains html markup like
and
– marcin_koss Oct 03 '12 at 04:25
tags which get decoded and display as part of the text. -
3I would add the obligatory link to [the joelonsoftware article _"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"_](http://www.joelonsoftware.com/articles/Unicode.html) – Burhan Khalid Oct 03 '12 at 04:54
-
1@MuMind "safe" filter worked perfectly. For some reason I didn't think the solution will be in jinja. As for the Flask I just love the simplicity of the framework. Thank you so much for your help! – marcin_koss Oct 03 '12 at 04:54
try add this to the top of your program.
import sys
reload(sys)
sys.setdefaultencoding('latin2')
We have to reload sys because:
>>> import sys
>>> sys.setdefaultencoding
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding
<built-in function setdefaultencoding>

- 6,701
- 4
- 19
- 28
-
Found this presentation where author is not recommending using setdefaultencoding() - http://farmdev.com/talks/unicode/ – marcin_koss Oct 03 '12 at 03:24
-
That presentation warns that setting the default encoding can break some third party modules, depending on what you're doing, it's probably not a significant problem. – Perkins Oct 03 '12 at 04:39
and
– Mu Mind Oct 03 '12 at 04:29tags which get decoded and display as part of the text." What web framework are you using?