5

I believe most of you who are familiar with Python have read Dive Into Python 3. In chapter 4.3, it says this:

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question.

Somehow I understand what this means: strings = characters in the Unicode set, and Python can help you encode characters according to different encoding methods. However, are characters in Pythons stored as bytes in computers anyway? For example, s = 'strings', and s is surely stored in my computer as a byte strem '0100100101...' or whatever. Then what is this encoding method used here - The "default" encoding method of Python?

Thanks!

deceze
  • 510,633
  • 85
  • 743
  • 889
endless
  • 97
  • 1
  • 4
  • 4
    Is there any other way to store _anything_ in anything else than bytes on a computer? – Kimvais Mar 15 '12 at 08:13
  • 1
    The same question is already asked: http://stackoverflow.com/questions/1838170/what-is-internal-representation-of-string-in-python-3-x – citxx Mar 15 '12 at 08:14

1 Answers1

8

Python 3 distinguishes between text and binary data. Text is guaranteed to be in Unicode, though no specific encoding is specified, as far as I could see. So it could be UTF-8, or UTF-16, or UTF-32¹ – but you wouldn't even notice.

The main point here is: You shouldn't even care. If you want to deal with text, then use text strings and access them by code point (which is the number of a single Unicode character and independent of the internal UTF – which may organise code points in several smaller code units). If you want bytes, then use b"" and access them by byte. And if you want to have a string in a byte sequence in a specific encoding, you use .encode().


¹ Or even UTF-9, if someone is insane enough to implement Python on a PDP-10.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • I have read the following chapters and I understand now. I shouldn't even care. This is a good point, thanks. – endless Apr 01 '12 at 00:40