Hack Jinja2 to encode from `utf-8` instead of `ascii`?

Question

Jinja2 converts all template variables into unicode before processing. Can anybody find a place where does this happen?

The problem is that it assumes that strings are ascii, but we (at Roundup) are using utf-8 internally and our ORM (HyperDB) restores object properties to utf-8 automatically, and converting them all into unicode in every view just before passing to templates is too much legwork.

avoid the implicit Python 2 default bytes->unicode coercion, pass unicode explicitly. Do not `setdefaultencoding('utf-8')` -- it hides bugs. — jfs, Feb 22 '15 at 10:37
@J.F.Sebastian, Python 2 doesn't have bytes. It assumes these are `ascii` strings. Can you be explicit in what bug becomes hidden by switching from `ascii` to `utf-8`? — anatoly techtonik, Feb 22 '15 at 10:50
What legwork? Why not just add a wrapper to the ORM to decode your strings? Why is this a problem with Jinja2? — Martijn Pieters, Apr 10 '15 at 14:05
@MartijnPieters, good question, why not to decode in ORM even before string is entered the core?.. I guess that's because Roundup uses values from DB to implement tracker logic, so it is not clear how they'd compare to unicode objects returned from DB. — anatoly techtonik, Apr 12 '15 at 06:52
@MartijnPieters, the problem with Jinja2 that it is assumes that all strings passed into it are 127 ASCII, if they are not Unicode, and there is no switch to make it think it is not 127 ASCII, but 256 WINDOWS-1251, for example. — anatoly techtonik, Apr 12 '15 at 06:53
@techtonik: this is not a Jinja2 problem. Jinja2 assumes you work with text and as such would like you to use *unicode only*. Any byte strings are converted to unicode with the default codec (so ASCII). Feed it `unicode` objects consistently and there is nop roblem. — Martijn Pieters, Apr 12 '15 at 09:23
@MartijnPieters Jinja2 2.8 is released https://github.com/mitsuhiko/jinja2/issues/511 - do you think this is now fixed? Maybe it is possible to overload concatenation methods in unicode objects to take control over implicit conversions from Python? — anatoly techtonik, Nov 10 '15 at 08:24
@MartijnPieters yes, but I still can not find a proof that it is possible to override unicode.__add__ method to resolve this. — anatoly techtonik, Nov 10 '15 at 19:00
Armin gave you proof. Jinja uses `unicode.format()` and `u'...' % ()` in places, which won't call `unicode.__add__`. Since those use string literals you can't provide a subclass either. — Martijn Pieters, Nov 10 '15 at 19:27
@techtonik: Python 2 does have `bytes`. As I said: don't rely on the implicit conversion in Python 2 (it is forbidden in Python 3). *"what bugs becomes hidden"*: for example, look up the word ["mojibake"](http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx) -- [utf-8 is not the only character encoding](http://stackoverflow.com/a/33726891/4279) — jfs, Nov 19 '15 at 04:59
@J.F.Sebastian you theory is great and advice is sound, but it is useless for the real world, which is this issue http://issues.roundup-tracker.org/issue2550811 - converting all objects all fields from 'utf-8' to unicode before passing to Jinja2 would be major performance hit. — anatoly techtonik, Nov 19 '15 at 08:32

anatoly techtonik · Accepted Answer · 2015-11-11T07:05:51.717

2

Answer from Armin:

Unfortunately that is impossible. Jinja uses the default string coercion on 2.x that Python provides for speed. There are no guaranteed calls to make something unicode. The only shitty choice you have is to reload sys and call sys.setdefaultencoding('utf-8') or something.

UPDATE: Jinja2 2.8 contains some updates related to implicit string conversions. This gives me the idea that it is possible to go without sys.setdefaultencoding('utf-8') by overriding __add__ methods of the unicode type and make sure that it is type is used first while concatenating strings.

https://github.com/mitsuhiko/jinja2/issues/511

edited Nov 11 '15 at 07:05

answered Feb 22 '15 at 10:32

anatoly techtonik

19,847
9
124
140

Great ! I went through all discussions about dangers on setdefaultendcoding(), your firm stand payed off well and caused great clarity. The religious fanatics couldn't bring an anecdote for a 'real' danger which exist only in their dreams. Just started and placed this in project level __init__.py itself :-D – nehem Nov 24 '15 at 09:30
@itsneo yea, the matter seems overly complicated, so nobody sees the whole picture. But if you come up with a real disaster story, don't forget to add me to CC. =) – anatoly techtonik Nov 24 '15 at 12:31

Hack Jinja2 to encode from `utf-8` instead of `ascii`?

1 Answers1

Linked