What is Python 3 str.__getitem__
computional complexity?
A: O(1)
Python strings are not utf-8 internally: in Python 3 when getting text from any external source, the text is decoded according to a given codec. This text decoding defaults to utf-8 in most sources/platforms, but varying accordingly to the S.O.'s default - anyway, all relevant "text importing" APIs, like opening a file, or connecting to a DB, allow you to specify the text encoding to use.
Inner strings use one of "Latin-1", "UCS-2" or "UCS-4" according to the needs of the "widest" codepoint in the text string.
This is new from Python 3.3 onwards (prior to that, all internal string representation would default to 32bit UCS-4, even for ASCII-only text). The spec is documented on PEP-393.
Therefore, Python can just zero-in the correct character given its index.
As an anecdote, Luciano Ramalho (author of Fluent Python book), wrote Leanstr
, a learning-purpose implementation of a string class that will hold utf-8 internally. Of course, then your worries about __getitem__
complexity apply: https://github.com/ramalho/leanstr
Unfortunatelly, (or fortunatelly, in this case), a lot of the standard library and native code extensions to Python will not accept a class similar to str
, even if it inherits from str
and keeps its data separetely, re-implementing all dunder methods. But if all str methods are in place, any pure-python code dealing with strings should accept a LeanStr
instance.
Other implementations: Pypy
So, it happens that how text is used internally is an "implementation detail", and Pypy from version 7.1 onwards does use utf-8 byte strings internally for its text objects.
Unlike Ramalho's naive "leanstr" above, however, they do keep an index for each 4th utf-8 char so that char access by index can still be made in O(1). I did not find any docs about it, but the code for creating the index is here.
I've mentioned this question on twiter, as I am an acquittance of Ramalho, and eventually Carl Friederich Bolz-Terich, one of Pypy developers, reached back:
It's worked really quite well for us! Most Unicode strings don't need this index, and zero copy utf-8 decoding is quite cool. What's most annoying is actually str.find, because there you need the reverse conversion, from byte index to char index. we don't have an index for that.
Tweet