How can a string be stored as a 'sequence of Unicode code points'?

Question

I am trying to understand string representation in python 3. I've seen various explanations on the site, and from the book learning python by Mark Lutz that in python 3, str objects are stored as Unicode code points. Quoting the book, "non-Unicode code sequences are sequences of 8 bit bytes that print with ASCII characters when possible, and Unicode strings are sequences of Unicode code points".

I understand the first part of the quote above, but I don't quite understand the second. How can a sequence of characters, such as when I type S = 'spam' into the console, be stored as 'Unicode code points'?

I believe code points are just numbers that correspond to characters, however the actual encoding that takes you from this number to binary representation depends on the system you choose to use, such as utf-8 or utf-32 . If this is true (please correct me if its not!), then in order for my variable S to be saved to memory, the computer must at some point convert 'spam' into some sequence of bytes. So I go from some characters to a binary for, which is a form of encoding? I have seen another post where it was explained that python does not do its own encoding.

I don't understand then how my variable S could be saved to memory without undergoing some form of encoding (not just storing the data as code points as the book explains)?

Thanks in advance.

In memory, the unicode code point is saved, which is a positive number. You can find out that number by using `ord()` on any unicode char. The memory basicly holds a binary representation of that number — Ralf, Oct 19 '18 at 18:25
related: [How is unicode represented internally in Python?](https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python) , [Why ord gives big endianness result while my platform is little endianness?](https://stackoverflow.com/questions/71062802/why-ord-gives-big-endianness-result-while-my-platform-is-little-endianness) — Rick, Feb 10 '22 at 10:46

user2357112 · Accepted Answer · 2018-10-19T19:08:23.583

2

Your quote doesn't say anything about the in-memory representation of a Unicode string. It says "Unicode strings are sequences of Unicode code points", not "are stored as".

This quote is a description of the meaning of a Unicode string, not its in-memory representation. Python has a lot of ways of representing Unicode strings internally, including ASCII, UTF-8, and UTF-32. It can even have multiple representations stored in the same string object; particularly, PyUnicode_AsUTF8AndSize will cause a string to store an auxiliary UTF-8 representation unless the string is ASCII (which is already valid UTF-8), and a string may also have a wchar_t representation stored.

All memory representations are implementation details and subject to change. If you want to see the internal representation, take a look at Include/unicodeobject.h

edited Oct 19 '18 at 19:08

answered Oct 19 '18 at 18:33

user2357112

260,549
28
431
505

Does this mean when I type S = 'spam' python is using a particular encoding to save the character sequence in memory? – masiewpao Oct 19 '18 at 18:37
1

@masiewpao: For that string, in current CPython, probably ASCII. – user2357112 Oct 19 '18 at 18:41
1

Python (at least CPython) does not use UTF-8 internally for strings. – Dietrich Epp Oct 19 '18 at 18:43
@user2357112 Also sorry for double commenting, but I don't understand why 'spam' is equivalent to '\u0073\u0070\u0061\u006D'. This implies that 's' is equivalent to '\u0073' which is just the code point, and similarly for the other characters. But as per this article, http://kunststube.net/encoding/ , the code point itself is not an encoding. So somehow the text 'spam' is being expressed as numbers (code points) but then how do those code points get saved in memory without encoding? – masiewpao Oct 19 '18 at 18:43
@user2357112 Ah I typed my previous comment before seeing your reply. Thank you! – masiewpao Oct 19 '18 at 18:46
@DietrichEpp: It does, never as the primary representation, but as an alternate representation. I'm referring to how functions like [PyUnicode_AsUTF8AndSize](https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF8AndSize) will generate a UTF-8 representation of a string and cache it on the string object. – user2357112 Oct 19 '18 at 18:53
Granted, but that is still a cache and the encoding of the string will still be either ASCII, LATIN1, UTF16 or UTF32. – Dietrich Epp Oct 19 '18 at 19:01

How can a string be stored as a 'sequence of Unicode code points'?

1 Answers1