3

I see that the Python manual mentions .encode() and .decode() string methods. Playing around on the Python CLI I see that I can create unicode strings u'hello' with a different datatype than a 'regular' string 'hello' and can convert / cast with str(). But the real problems start when using characters above ASCII 127 u'שלום' and I am having a hard time determining empirically exactly what is happening.

Stack Overflow is overflowing with examples of confusion regarding Python's unicode and string-encoding/decoding handling.

What exactly happens (how are the bytes changed, and how is the datatype changed) when encoding and decoding strings with the str() method, especially when characters that cannot be represented in 7 bytes are included in the string? Is it true, as it seems, that a Python variable with datatype <type 'str'> can be both encoded and decoded? If it is encoded, I understand that means that the string is represented by UTF-8, ISO-8859-1, or some other encoding, is this correct? If it is decoded, what does this mean? Are decoded strings unicode? If so, then why don't they have the datatype <type 'unicode'>?

In the interest of those who will read this later, I think that both Python 2 and Python 3 should be addressed. Thank you!

Community
  • 1
  • 1
dotancohen
  • 30,064
  • 36
  • 138
  • 197

1 Answers1

3

This is only the case in Python 2. The existence of a decode method on Python 2's strings is a wart, which has been changed in Python 3 (where the equivalent, bytes, has only decode).

You can't 'encode' an already-encoded string. What happens when you do call encode on a str is that Python implicitly calls decode on it using the default encoding, which is usually ASCII. This is almost always not what you want. You should always call decode to convert a str to unicode before converting it to a different encoding.

(And decoded strings are unicode, and they do have type <unicode>, so I don't know what you mean by that question.)

In Python 3 of course strings are unicode by default. You can only encode them to bytes - which, as I mention above, can only be decoded.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • Thank you Daniel. I think that I may be best off porting to Python 3 and being done with it. I find the implicit decoding done in Python to be not only very 'unpythonic' (explicit is better than implicit) but also very confusing as the developer is not aware that such a conversion has taken place. Plus, it is decoding using the wrong encoding! – dotancohen Jun 12 '13 at 11:06