Trace functions that are called on Python strings

Question

Is it possible in Python to trace and filter functions that are called on strings during program run? I want to add sys.setdefaultencoding("utf-8") application, and I want to set some guards to predict potential problems with misusing standard functions (like len, for example), to process such strings.

I guess that despite the advice you've recently received about the [Dangers of sys.setdefaultencoding('utf-8')](http://stackoverflow.com/questions/28657010/dangers-of-sys-setdefaultencodingutf-8) (also, [Hack Jinja2 to encode from `utf-8` instead of `ascii`?](http://stackoverflow.com/q/28642781/4014959)) that you are still determined to use it... — PM 2Ring, Apr 12 '15 at 06:57
Isn't the fact that you already need to start implementing hacks like these indication enough for you that it's a bad idea to mess with `sys.setdefaultencoding()`? Use unicode internally, [pass unicode to Jinja templates](http://jinja.pocoo.org/docs/dev/api/#unicode), and only convert to/from `utf-8` at your system boundaries. — Lukas Graf, Apr 12 '15 at 07:22
@PM2Ring I need to fix http://issues.roundup-tracker.org/issue2550811 ASAP, because it blocks next Roundup release and its development, and switching to using unicode internally looks like epic task that will take a lot of time and break existing template engine (TAL) and extensions. — anatoly techtonik, Apr 12 '15 at 12:38
@techtonik then *fix it*, and don't pretend to fix it by applying hacks on top of bandaids. How about this: Include [`unicode-nazi`](https://pypi.python.org/pypi/unicode-nazi) in your tests, and run your test suite (assuming you've got decent test coverage). Every place where `unicode-nazi` complains about implicit conversions, you'll introduce a subtle, hard to find bug when you change the default encoding. — Lukas Graf, Apr 12 '15 at 19:11
@LukasGraf, `unicode-nazi` is a good suggestion. The problem is that implicit conversions are made by Jinja2, and another templating layer (TAL) already works with `utf-8` byte strings. Need to investigate it further if TAL works with Unicode. There is also email backend that may break, so it is not that simple to just go Unicode. There are also user extensions that we can not check. — anatoly techtonik, Apr 13 '15 at 14:22
@techtonik but you could surely just decode your byte strings to unicode before you pass them to Jinja2 templates? No need to switch *everything* to use unicode internally at once, though I'd certainly consider it for the long run. — Lukas Graf, Apr 13 '15 at 17:12
@LukasGraf, many strings are passed to Jinja2 as object properties and I don't know a way to do such copy and patch operation on any possible object safely. The only way to deal with that is to get objects from ORM already converted. But that misses out data from internet forms and other sources (email). — anatoly techtonik, Apr 16 '15 at 08:21

score 2 · Answer 1 · answered Apr 12 '15 at 06:51

2

You can replace the builtin:

import __builtin__

real_len = __builtin__.len

def checked_len(s):
    ... do extra checks ...
    return real_len(s)

__builtin__.len = checked_len

answered Apr 12 '15 at 06:51

R Samuel Klatchko

74,869
16
134
187

This works for `len`, but not much for other functions. How to avoid the check for strings where its legal to use `len`, because I need `len` in bytes (such as calculating HTTP headers). There are variables and operations (not string content) for which this check should not fire. It is ok to set those filters manually. – anatoly techtonik Apr 12 '15 at 07:00

Trace functions that are called on Python strings

1 Answers1

Linked