1

I'm looking for a simple way of converting a user-supplied string to UTF-8. It doesn't have to be very smart; it should handle all ASCII byte strings and all Unicode strings (2.x unicode, 3.x str).

Since unicode is gone in 3.x and str changed meaning, I thought it might be a good idea to check for the presence of a decode method and call that without arguments to let Python figure out what to do based on the locale, instead of doing isinstance checks. Turns out that's a not a good idea at all:

>>> u"één"
u'\xe9\xe9n'
>>> u"één".decode()
Traceback (most recent call last):
  File "<ipython-input-36-85c1b388bd1b>", line 1, in <module>
    u"één".decode()
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

My question is two-fold:

  1. Why is there a unicode.decode method at all? I thought Unicode strings were considered "not encoded". This looks like a sure way of getting doubly encoded strings.
  2. How do I tackle this problem in a way that is forward-compatible with Python 3?
Fred Foo
  • 355,277
  • 75
  • 744
  • 836

3 Answers3

5

It's not useful to speak of "decoding" a unicode string. You want to encode it to bytes. unicode.decode is solely there for historical reasons; its semantics are meaningless. Therefore, it has been removed in Python 3.

However, the encode/decode semantics have historically been extended to include (character) string-to-string or byte-to-bytes encodings such as rot13 or bzip2. In Python 3.1, these pseudo encodings were removed, and reintroduced in Python 3.2.

In general, you should design your interfaces so that they either accept character or byte strings. An interface that accepts both (for reasons other than backwards compatibility) is a code smell, hard to test, prone to bugs (what if someone passes UTF-16 bytes?) and has questionable semantics in the first place.

If you must have an interface that accepts both character and byte strings, you can check for the presence of the decode method in Python 3. If you want your code to work in 2.x as well, you'll have to use isinstance.

phihag
  • 278,196
  • 72
  • 453
  • 469
  • Just what I thought. But then, how do I tackle the problem of going from any `basestring` to UTF-8, without `isinstance`? – Fred Foo Jul 21 '12 at 13:16
  • Updated the answer. That's a problem that shouldn't occur in the first place - you should know what you're getting passed. I'm afraid you'll have to use `isinstance` if you want Python 2 and 3 compatibility. – phihag Jul 21 '12 at 13:22
1

Conversion between str and unicode is not the only purpose of encode/decode. There are also codecs.

For example (in Python 2):

>>> u'123'.encode('hex')
'313233'
>>> '313233'.decode('hex')
'123'
>>> u'313233'.decode('hex')
'123'

I'm not sufficiently familiar with Python 3 to be able to say whether or not this works there.

Jeff Bradberry
  • 1,597
  • 1
  • 12
  • 11
  • Wow, +1 for the complete surprise. – Fred Foo Jul 21 '12 at 13:27
  • 2
    "hex" for string-to-string conversoins is only available in 2.x; it [has been removed in 3.x](http://stackoverflow.com/questions/2340319/python-3-1-1-string-to-hex). – phihag Jul 21 '12 at 13:27
1
  1. The Unicode object has a decode() method because it inherits from basestring and basestring has one, so Unicdode has to have one as well.

  2. You tackle the problem by never decoding Unicode strings, in Python 2 or Python 3. As you note, it makes no sense to do so. So don't.

How then do you handle this in a compatible wait in Python 2 and Python 3? well, you don't use strings for binary data, you use bytes. They have a decode() method that works in all versions of Python.

For more information on this see http://python3porting.com/noconv.html and also http://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251