6

Short story:

Is Python 3 unicode string lookup O(1) or O(n)?

Long story:

Index lookup of a character in a C char array is constant time O(1) because we can with certainty jump to a contiguous memory location:

const char* mystring = "abcdef";
char its_d = mystring[3];

Its the same as saying:

char its_d = *(mystring + 3);

Because we know that sizeof(char) is 1 as C99, and because of ASCII one character fits in one byte.

Now, in Python 3, now that string literals are unicode strings, we have the following:

>>> mystring = 'ab€cd'
>>> len(mystring)
5
>>> mybytes = mystring.encode('utf-8')
>>> len(mybytes)
7
>>> mybytes
b'ab\xe2\x82\xaccd'
>>> mystring[2]
'€'
>>> mybytes[2]
226
>> ord(mystring[2])
8364

Being UTF-8 encoded, byte 2 is > 127 and thus uses a multibyte representation for the character 3.

I cannot other than conclude that a index lookup in a Python string CANNOT be O(1), because of the multibyte representation of characters? That means that mystring[2] is O(n), and that somehow a on-the-fly interpretation of the memory array is being performed ir order to find the character at index? If that's the case, did I missed some relevant documentation stating this?

I made some very basic benchmark but I cannot infer an O(n) behaviour: https://gist.github.com/carlos-jenkins/e3084a07402ccc25dfd0038c9fe284b5

$ python3 lookups.py
Allocating memory...
Go!
String lookup: 0.513942 ms
Bytes lookup : 0.486462 ms

EDIT: Updated with better example.

Havok
  • 5,776
  • 1
  • 35
  • 44

1 Answers1

9

UTF-8 is the default source encoding for Python. The internal representation uses fixed-size per-character elements in both Python 2 and Python 3. One of the results is that accessing characters in Python (Unicode) string objects by index has O(1) cost.

The code and results you presented do not demonstrate otherwise. You convert a string to a UTF-8-encoded byte sequence, and we all know that UTF-8 uses variable-length code sequences, but none of that says anything about the internal representation of the original string.

Community
  • 1
  • 1
John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • I wrongly assumed that the internal representation of a unicode string was a UTF-8 byte encoded bytes array. Thank you for the explanation! – Havok Dec 15 '16 at 21:10
  • 1
    How can the characters be a fixed size, if they can be arbitrarily long? E.g. "‍‍‍" is 5 code points (man + zero width joiner + woman + girl + boy). You can't fit that in a single 8, 16 or 32 bit element – Alexander Apr 09 '20 at 15:22
  • @Alexander-ReinstateMonica, Unicode characters are *not* arbitrarily long. Each one is exactly one character long, and each character corresponds to a number drawn from a 21-bit code space. Even variable-length encodings such as UTF-8 provide a *bounded*-length character representation. No valid UTF-8 encoding of a single Unicode character is longer than four bytes. I think you are failing to distinguish between individual characters and strings. – John Bollinger Apr 09 '20 at 18:04
  • 2
    @JohnBollinger By "character", I was referring to user-perceived characters, or what Unicode would call legacy grapheme cluster or extended grapheme clusters, such as that family Emoji above. It's a single character as far as a user is concerned, and a single extended grapheme cluster, composed of 5 separate code points (man, joiner, woman, girl, boy). It appears python 3's string subscripting has the index referring to code points, not characters. So `"‍‍‍"[0]` gives `""`, not the full family character – Alexander Apr 09 '20 at 19:19
  • Fair enough, @Alexander-ReinstateMonica, but what you are referring to as a "character", is not what Unicode, or Python, or any of the other programming languages I work with mean by that term (though there is a diversity of definitions among those), nor is it what the OP was asking about here, nor, therefore, what I mean by that term in this answer and these comments. I would call a Python (or Java or C) representation of the multi-code-point unit in your example a "string", as I supposed. – John Bollinger Apr 09 '20 at 20:24
  • @JohnBollinger From my reading of it, I think OP was thinking of the same concept of "characters" that I was, which led my down this road. His whole suspicion that indexing might be `O(n)` is predicated on having non-fix width characters. If there's a fixed size, whether it be 8, 16, or 32 bits, it's still constant-time index-able. – Alexander Apr 12 '20 at 23:40
  • To clear the confusion, I'll refer to my understanding of characters (user-perceived characters, formally called "extended grapheme clusters") as EGCs. It seems that python is totally unaware of EGCs, which leads to the string API being wrong on EGCs, like Emojis, accented characters and such. It's totally broken, e.g. `len("naïve") != len("naïve")`: https://bugs.python.org/issue30717 I found the `grapheme` package, which seems to handle this properly: https://pypi.org/project/grapheme/ Do you know of any other built-in ways? – Alexander Apr 12 '20 at 23:50
  • 1
    I find this really strange, given how high-level of a language python is. Individual code points are like implementation details, and I have no idea why indexing strings would yield code points, which makes string manipulations easily result in invalid results. You can easily unintentionally make a change that yields an invalid ordering, like removing the first letter of `"ïabc"` would make an `0x0308` accent that doesn't follow after a character (thus nothing to accent). – Alexander Apr 12 '20 at 23:53
  • The unicode scalars are only really useful in low level text encoding code, not for user-level stuff that python usually deals with. I was really surprised to see that not only are EGCs not the default subunit of strings, but there's no built-in way to deal with them correctly (that I found, at least). Also, it appears that stack overflow canonicalizes strings in comments, so my example didn't carry accross. It was a string with one of those strings was equivalent to `u"nai\u0308ve"` – Alexander Apr 12 '20 at 23:58
  • @Alexander-ReinstateMonica, whereas I recognize that extended grapheme clusters are a thing, I see absolutely no reason whatever to think that they are what the OP was asking about. The OP's concern was explicitly that retrieving a character by index from a Python Unicode string was inefficient *on account of UTF-8 being a variable-length encoding*. There is no reason to think that they meant anything different by the word "character" than what that word normally means in the jargon of Python. – John Bollinger Apr 13 '20 at 03:02
  • Moreover, although I appreciate that you have an interest in Unicode text processing in all its gory detail, I do not at all accept the assertion that "The unicode scalars are only really useful in low level text encoding code", or that Python ought to present a different view of Unicode strings at the language level. The fact is that full-blown Unicode text processing is baroque, and details such as extended grapheme clusters just are not relevant to most programmers' work. They are not the basic unit of Python strings because most programmers do not need or want to care about them. – John Bollinger Apr 13 '20 at 03:22
  • I think you're right about utf8 being the source of the length variability OP was concerned about, not EGCs – Alexander Apr 13 '20 at 13:00
  • It doesn't matter how the source code is encoded. That string instance could be a tweet fetched from an API, for example. The indexing defies what users think characters to be. When you go into the character picker to type an emoji, you don't peck away at the individual members of the family, the zero width joiner, etc. It's a single character as far as the user is concerned. To then say "the first character of this string starting with a family, is a single man on his own", would make no sense to any user, or any dev not familiar with unicode's impl details (most?) – Alexander Apr 13 '20 at 13:38
  • Anyhow, I don't think I'll be able to convince you of just how bizarre this behaviour is, I suspect it's a [curse of knowledge situation](https://en.wikipedia.org/wiki/Curse_of_knowledge). Internationalization is hard, and it's made harder by (IMO) bad defaults like this, that let people unknowingly splice EGCs to obtain invalidly sequenced scalars. Most dev prob don't even know what any of that means, let alone that they introduced a bug, and it's why so much international code is so broken – Alexander Apr 13 '20 at 13:40
  • @Alexander-ReinstateMonica, that it does not matter how the source code is encoded is precisely the point of this answer, so at least we can agree on that. We can also agree that a lot of programs, not just in Python, fail to handle input in a way that is sensible with respect to full Unicode text semantics. I daresay these include many of the text editors that programmers use to write code. I also agree that neither of us appears likely to persuade the other. – John Bollinger Apr 13 '20 at 19:19
  • However, that we disagree about the importance and severity of that problem, and especially that we disagree about how it should impact Python's or any other language's design, does not imply a curse of knowledge problem *in either direction*, and I frankly find it a bit arrogant of you to suggest otherwise. Well-informed people disagree on matters of opinion all the time. – John Bollinger Apr 13 '20 at 19:23