8

What is the default encoding used for encoding strings in python 2.x? I've read that there are two possible ways to declare a string.

string = 'this is a string'
unicode_string = u'this is a unicode string'

The second string is in Unicode. What is the encoding of the first string?

jpp
  • 159,742
  • 34
  • 281
  • 339
Cortex
  • 83
  • 1
  • 6
  • 2
    Unicode is not an encoding. The encoding of source file (which has the strings) is usually the same for the two files. The first is encoded internally as `unsigned char`. The second as `UCS2` in Python2 (so a unicode code can be represented by one or two python `u` characters. On Python3, it can be one of ASCII, UTF16 or UTF32 (selected dynamically), so just ignore how Python encode internally characters. – Giacomo Catenazzi Apr 20 '18 at 12:33

4 Answers4

10

As per Python default/implicit string encodings and conversions (reciting its Py2 part concisely, to minimize duplication):

There are actually multiple independent "default" string encodings in Python 2, used by different parts of its functionality.

  • Parsing the code and string literals:

    • str from a literal -- will contain raw bytes from the file, no transcoding is done
    • unicode from a literal -- the bytes from the file are decode'd with the file's "source encoding" which defaults to ascii
    • with unicode_literals future, all literals in the file are treated as Unicode literals
  • Transcoding/type conversion:

    • str<->unicode type conversion and encode/decode w/o arguments are done with sys.getdefaultencoding()
      • which is ascii almost always, so any national characters will cause a UnicodeError
    • str can only be decode'd and unicode -- encode'd. Trying otherwise will involve an implicit type conversion (with the aforementioned result)
  • I/O, including printing:

    • unicode -- encode'd with <file>.encoding if set, otherwise implicitly converted to str (with the aforementioned result)
    • str -- raw bytes are written to the stream, no transcoding is done. For national characters, a terminal will show different glyphs depending on its locale settings.
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • The transcoding section also has an inaccuracy, Python 2 allows binary transforms where there isn't implicit conversion. For example, 'base64' codec and 'unicode-escape' codec. And user can register custom codecs, in Python 2 there is no such restriction "str can only be decoded and unicode encoded", it's just convention. – wim Apr 24 '18 at 15:54
  • @wim `base64` et al. have nothing to do with _default_ (implicit) encodings, so they are off topic. The `decode`'d/`encode`'d explanation is clear enough IMO: the following sentence states right away that "Python 2 makes it look like you can, but this is just an illusion, actually you cannot" -- which would instantly clear any confusion that might result from the previous sentence, before it manages to take roots. – ivan_pozdeev Apr 24 '18 at 17:16
  • Nothing in the claim said it's restricted to implicit encoding/decoding (and, further, the idea that this question is even asking about implicit encodes/decodes seems imagined by you in the first place - I don't see that in the question!). So the statement as written is still plainly wrong. It's simply not like Python 3 where `str.decode` and `bytes.encode` are `AttributeError`. – wim Apr 24 '18 at 17:56
  • More in line with reality: both `unicode` and `str` types have their own `encode` and `decode` methods. It's up to the codec to decide what to do - using `unicode.decode`/`str.encode` *may* (or may not) cause an implicit type conversion. – wim Apr 24 '18 at 18:06
6

The literal answer is that they do not necessarily represent any particular encoding. In Python 2, a string is just an array of bytes, exactly like the bytes type in Python 3. For a string s you can call s.decode() to get a Unicode string, but you must* pass the encoding manually for exactly that reason. You could use a string to hold ASCII bytes, or characters from Windows code page 850 (which is a superset of ASCII), or UTF8 bytes, or even UTF16 bytes. The last case is interesting because even if the characters in that string are in the ASCII range, the bytes do not match the ASCII-encoded version (they will alternate with the null character). The string type is even suitable for bytes of some binary format that do not correspond to any encoded string e.g. the bytes of an image file.

A more practical answer is that often ASCII is assumed. For example, the literal string "xyz" will give a three byte string with the bytes corresponding to the ASCII encoding of those characters.

This ambiguity is the reason for the change in behaviours and conventions around strings in Python 3.

* As noted in CristiFati's answer, it is possible to omit the encoding= argument to decode, in which case ASCII will be assumed. My mistake.

Arthur Tacca
  • 8,833
  • 2
  • 31
  • 49
  • Python2 use UCS2 not UTF16, so sometime one unicode character is represented by 2 python (UCS2) characters. (So a u-string could be of `len` 2 in python2 and 1 in python3. – Giacomo Catenazzi Apr 20 '18 at 12:36
  • @GiacomoCatenazzi You are talking about unicode strings (`unicode` in Python 2, `str` in Python 3). This discussion is entirely about byte strings (`str` in Python 2, `bytes` in Python 3). If I want to put UTF16 bytes into one of those, I can do. I'm not sure whether Python's `.decode(encoding=...)` method supports it, but even if not I can still use other techniques to get such a sequence of bytes. – Arthur Tacca Apr 20 '18 at 14:49
3

As @ArthurTacca explained in his answer, a string ("this is a string") is just an array of bytes (0x74 0x68 0x69 0x73 0x20 0x69 0x73 0x20 0x61 0x20 0x73 0x74 0x72 0x69 0x6e 0x67), and its encoding makes no sense outside decoding context (when the bytes are interpreted).

Check out [Python 2.Docs]: sys.getdefaultencoding().

>>> sys.version
'2.7.10 (default, Mar  8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)]'
>>> sys.getdefaultencoding()
'ascii'
>>> "\xff".decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
CristiFati
  • 38,250
  • 9
  • 50
  • 87
3

The first string does not have an encoding. It is raw bytes. A convincing way to prove this to yourself is to change the encoding used to decode the source code to something else, using the coding declaration. This way you can visibly tell the difference between ASCII and bytes.

Save this to a .py file and execute it:

# coding: rot13

s0 =  "this is a string"
s1 = o"this is a string"
s2 = h"guvf vf n fgevat"

nffreg s0 == s1 == s2
cevag s0
cevag s1
cevag s2

This source is encoded in a simple letter substitution cipher. Letters in a-z A-Z are "rotated" by 13 places, other characters are unchanged. Since there are 26 letters in the alphabet, rotating twice is an identity transform. Note that the coding declaration itself is not rotated, see PEP 263 if you want to understand why.

  • nffreg is an assert statement, saying that these three strings all compare equal.
  • cevag is a print statement.
  • s2 is a unicode string with rotated u prefix. The other two are bytestrings.

Now, let's change the handling of the first string, by introducing the unicode literals __future__ import. Note that this future statement itself must be rotated, or you'll get a syntax error. This alters the way the tokenizer/compiler combo will process the first object, as will become evident:

# coding: rot13
sebz __shgher__ vzcbeg havpbqr_yvgrenyf

s0 =  "guvf vf n fgevat"
s1 = o"this is a string"
s2 = h"guvf vf n fgevat"

nffreg s0 == s1 == s2
cevag s0
cevag s1
cevag s2

We needed to change the text from this is a string into guvf vf n fgevat in order for the assert statement to remain valid. This shows that the first string does not have an encoding.

wim
  • 338,267
  • 99
  • 616
  • 750
  • Hi @wim, thank for your answer. at the s1 = o"this is a string" line I get a SyntaxError: invalid syntax. Do you know what may have caused this? – Cortex Apr 23 '18 at 15:52
  • If you copy the code exactly as I have written it (including the comment at the top) and execute in a Python 2.7 interpreter, there is no syntax error. – wim Apr 23 '18 at 15:57
  • I had a problem to run it in "IPython 5.4.1" but I managed to run it as a python file. – Cortex Apr 23 '18 at 16:08