0

When using unicode strings in source code, there seems to be many ways to skin a cat. The docs and the relevant PEPs have plenty of information about what's possible, but are scant about what is preferred.

For example, the following each seem to give same result:

# coding: utf8
u1 = '\xe2\x82\xac'.decode('utf8')
u2 = u'\u20ac'
u3 = unichr(0x20ac)
u4 = "€".decode('utf8')
u5 = u"€"

If using the __future__ imports, I've found one more option:

# coding: utf8
from __future__ import unicode_literals
u6 = "€"

In python I am used to there being one obvious way to do it, so what is the recommended method of including international content in source files?

This is a python 2 question.


some background...

Methods u1, u2, u3 just seem silly to me, but I have seen enough people writing like this that I assume it is not just personal preference - is there any particular reason why we might want to force only ascii characters in source files, rather than specifying the encoding, or is this just a habit more likely to be found in older code lying around?

There's huge readability improvement in the code to use the actual symbols rather than some escape sequences, and to not do so would seem to be ignoring the strengths of the language rather than taking advantage of hard work by the python devs.

wim
  • 338,267
  • 99
  • 616
  • 750
  • 1
    Related: [Any gotchas using unicode\_literals in Python 2.6?](http://stackoverflow.com/q/809796) – Martijn Pieters Apr 14 '14 at 14:54
  • The rest is primarily opinion-based, I fear. `u1`, and `u3` and `u4` are in my opinion not something I'd ever consider. `u2` or `u5` for me, with `u2` less susceptible to other people misconfiguring their text editors, but should only be used with a comment after it to explain what codepoint it represents, and if such non-ASCII codepoints are relatively rare. – Martijn Pieters Apr 14 '14 at 15:04

1 Answers1

2

I think the most common way I've used (in Python 2) is:

# coding: utf-8

text = u'résumé'
  • The text is readable. Compare to text = u'r\u00e9sum\u00e9', where I must look up what character that is. Everything else is less readable.
  • If you're using Unicode, your variable is most certainly text and not binary data, so there's no point in keeping it in anything other than a unicode object. (Just in case '€' became an option.)

from __future__ import unicode_literals changes the parsing mode of the program; I think you'd need to be more aware of the difference between text & binary data. (Something that, if you ask me, most programmers are not good at.)

In large projects, it might be confusing to have the parsing mode change for just one file, so it's probably better as an all files or no files, so you don't need to refer to the file header. If you're in Python 2, the default is probably off unless you're also targetting Python 3. If you're targetting Python 2.5 or older¹, then it's not an option.

Most editors these days are Unicode-aware. That said, I have seen editors corrupt non-ASCII characters in files, but exceedingly rarely; if the author of such a commit doesn't review his code adequately, code review should catch this. (The diff will be painfully obvious.) It is not worth supporting these people: Unicode is here to stay; track them down and fix their set up. Of note, vim handles Unicode just fine.

¹You should upgrade.

Thanatos
  • 42,585
  • 14
  • 91
  • 146