0

My code includes

from __future__ import unicode_literals

and has many functions that accept (and expect) Unicode strings as input in order to function fully.

Is there a way to ensure that users (in scripts, Python, or IPython, etc.) also use Unicode literals so that, for example

my_func("AβC")

does not cause an error ("ascii' codec can't decode byte 0xce ...") and so that

my_func(u"AβC")

is not necessary?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
orome
  • 45,163
  • 57
  • 202
  • 418

1 Answers1

2

No, unicode_literals is a per-module configuration and must be imported in each and every module where it is to be used.

The best way is just to mention it in the docstring of my_func that you are expecting unicode objects.

If you insist, you could enforce it to fail early:

def my_func(my_arg):
    if not isinstance(my_arg, unicode):
        raise Exception('This function only handles unicode inputs')

If you need it in many places, it might be nicer to implement it with a decorator.

On python3.5 or up, you could use type hints to enforce this.

wim
  • 338,267
  • 99
  • 616
  • 750
  • Instead of raising an exception, can I just `my_arg = my_arg.decode('utf-8')`. This *seems* to work, and solves all the problems (at least in my test cases) where I might have non-Unicode arguments supplied but my code assumes Unicode. – orome Nov 18 '15 at 20:33
  • I would advise against that, because you should never be calling `.decode` on a unicode object. Python 2 is sloppy here, but It will raise an exception in python 3. – wim Nov 18 '15 at 20:49
  • Yes, so I'd do it inside `if not isinstance(my_arg, unicode):`, correct? – orome Nov 18 '15 at 20:54
  • 1
    It's feasible, but you have to modify the argspec to give the caller a way to provide the encoding. `def my_func(my_arg, encoding='utf-8')` then `if isinstance(my_arg, str): my_arg = my_arg.decode(encoding)`. Remember, unicode objects are already decoded. If the caller can ever sends you a bytes object it is also **their responsibility to tell you the encoding**. In my opinion, it's a confusing interface to accept both bytestrings and unicode strings, better to forbid the bytestrings coming in at all earlier in the app. – wim Nov 18 '15 at 20:58
  • Ah right (I'm still confused about Unicode in general): the incoming bytestring could have *any* encoding, I can't assume UTF-8 — correct? – orome Nov 18 '15 at 21:07
  • That's correct. And it is impossible to tell from the bytestring itself. If someone sends you bytes, it's *already encoded* and they have to tell you the encoding so that you can decode it correctly. – wim Nov 18 '15 at 22:01
  • Yes, so an error here, is clearly the correct (at least most honest approach). One final question: Is this Pythonic? I mean, is it polite/ok/reasonable to fail like this? It seems not only a bit harsh, but also weird that it's necessary. – orome Nov 18 '15 at 22:39
  • Here's [what I ended up doing](https://github.com/orome/crypto-enigma-py/compare/develop@%7B1day%7D...develop#diff-216f7b68af59a0ecdffef5810c8bb3ef). *But* — it messes up my command line script, as can be seen at the end of [this log](https://travis-ci.org/orome/crypto-enigma-py/builds/91929576). Any ideas on how to fix that? – orome Nov 18 '15 at 23:01
  • It looks like decoding with `sys.getfilesystemencoding()` might do the trick, but I'm not 100% sure that's the correct encoding. – orome Nov 18 '15 at 23:29
  • As I mentioned, best practice is to find where any bytestrings can come into your app and decode them immediately there. Then just write your code assuming everything is unicode already. Ned explains it better than I could, in his presentation here -> [unipain](http://nedbatchelder.com/text/unipain.html) – wim Nov 18 '15 at 23:30
  • Yes, I think that's what I'm doing now. My only remaining question is how to determine the encoding used at the command line. Using `sys.getfilesystemencoding()` seems to work, but I'd like to confirm that it always does. (And that seems to be the only place I can determine encoding, right?) – orome Nov 18 '15 at 23:35