36

What exactly is a unicode string?

What's the difference between a regular string and unicode string?

What is utf-8?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (ustring from above contains a unicode string)
> s = ustring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1'  ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8')             ## Convert bytes back to a unicode string
> t == ustring                      ## It's the same as the original, yay!
True

Files Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string
evanhutomo
  • 627
  • 1
  • 11
  • 24
Stevanus Iskandar
  • 409
  • 1
  • 4
  • 5

2 Answers2

59

Update: Python 3

In Python 3, Unicode strings are the default. The type str is a collection of Unicode code points, and the type bytes is used for representing collections of 8-bit integers (often interpreted as ASCII characters).

Here is the code from the question, updated for Python 3:

>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix
# the escape sequence "\u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode \xc6\x8e string \xc3\xb1'
# the "b" prefix means a bytes literal
# the escape sequence "\x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True

Working with files:

>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>>     # here line is a str object

>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>>     # here line is a bytes object

Historical answer: Python 2

In Python 2, the str type was a collection of 8-bit characters (like Python 3's bytes type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original code points by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.

tom
  • 21,844
  • 6
  • 43
  • 36
  • 2
    More specifically, the above is true for Python v2. In Python v3, Unicode strings are the default. – tripleee Feb 16 '14 at 10:49
  • thanks, ...but when will we be able to actually "see" those unicode characters? Will we kind of "inject" our python code into a system which is able to display those? – aderchox Apr 17 '19 at 05:20
  • 1
    Usually nowadays if you simply print a string to console output, or write it to a file which you then view in an editor, you will be able to see any non-ascii characters. Since utf8 is mostly backwards compatible with ascii anyway, most systems should now assume utf8 encoding by default. (For the same reason you should be able to save unicode characters directly into your .py file, and skip the escaped representations.) @aderchox – benjimin Jan 28 '20 at 03:29
-5

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers". The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.

  • 1
    You can have a detailed look into it on this blog http://www.carlosble.com/2010/12/understanding-python-and-unicode/ – Renjith Nair Feb 16 '14 at 07:55
  • 4
    -1 Not an accurate answer. Those are not "pointers" and both types are used to represent strings. – tripleee Feb 16 '14 at 08:18