3
''' Set up '''
s= open("Bilion_of_UTF-8_chars.txt",encoding="UTF-8").read()

'''
The following doesn't look like a cheap operation
because Python3 `str`-s are UTF-8 encoded (EDIT: in some implementations only).
'''
my_char= s[453_452_345]

However, many people write loops like this:

for i in range(len(s)):
    do_something_with(s[i])

using indexing operation up to n times or more.

How does Python3 resolve the problem of indexing UTF-8 characters in strings for both code snippets?

  • Does it always perform a linear look-up for nth char (which is both simple & expensive resolution)?
  • Or maybe it stores some additional C pointers to perform smart index calculations?
Sir
  • 337
  • 1
  • 7

1 Answers1

5

What is Python 3 str.__getitem__ computional complexity?

A: O(1)

Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.

Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.

This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.

Therefore, Python can just zero-in the correct character given its index.

As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote Leanstr, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about __getitem__ complexity apply: https://github.com/ramalho/leanstr

Unfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to str, even if it inherits from str and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept a LeanStr instance.

Other implementations: Pypy

So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.

Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.

I've mentioned this question on twiter, as I am an acquittance of Ramalho, and eventually Carl Friederich Bolz-Terich, one of Pypy developers, reached back:

It's worked really quite well for us! Most Unicode strings don't need this index, and zero copy utf-8 decoding is quite cool. What's most annoying is actually str.find, because there you need the reverse conversion, from byte index to char index. we don't have an index for that.

Tweet

jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Python or CPython? I've read that PyPy does use utf-8 internally. – Kelly Bundy May 28 '22 at 01:35
  • While the PEP does not specify this to be Python wide, I am quite confident modern Pypy can't use internal UTF-8 text data, for the reasons in the question: it intends to be faster implementation, and it would be though to achieve that with one of the most common operations in tight loops going from O(1) to O(N) – jsbueno May 28 '22 at 02:10
  • Maybe you are mistaking "utf-8" for "unicode"? – jsbueno May 28 '22 at 02:18
  • They say [utf-8](https://doc.pypy.org/en/latest/release-v7.1.0.html). – Kelly Bundy May 28 '22 at 02:39
  • so, no. Maybe it does not do the item-size scaling thing as described in this pep, but it will certainly use a fixed-bytesize per codepoint. – jsbueno May 28 '22 at 02:41
  • and as far as we are talking about other Python implementations: it will be easier to have an implementation that does not do anything other than 0-255 latin1 characters than one that uses a varying width internal representation for text-data, such as utf-8. It is too key an aspect of how the language works. – jsbueno May 28 '22 at 02:44
  • UTF-8 isn't fixed-size, though, and Armin Rigo [wrote](https://www.pypy.org/posts/2019/03/pypy-v71-released-now-uses-utf-8-451324088028792912.html#utterances-thread) "The new PyPy is very different: it uses the UTF8 string *only*". – Kelly Bundy May 28 '22 at 02:52
  • Yes - I've seem the 7.1 release document, but the "English" docs only go so far. The code itself seems to support both modes - maybe it switches deepending on running "rpython" layer code and Python layer code. https://foss.heptapod.net/pypy/pypy/-/blob/branch/default/rpython/rlib/rstring.py#L19 – jsbueno May 28 '22 at 02:54
  • found the code relating to it - I am updatign the answer now. Thanks for your insistence. – jsbueno May 28 '22 at 03:01
  • 1
    Thanks. Best I found outside their code (which I'm not familiar with) is their ["Unicode strings are now internally represented as utf-8 in PyPy, with an optional extra index data structure to make indexing O(1). We'll write a blog post about it eventually."](https://mobile.twitter.com/pypyproject/status/1095971192513708032). – Kelly Bundy May 28 '22 at 16:49