0

I'm a PHP programmer who will be moving over to Python soon.

In the PHP world there are two functions for dealing with string slices mb_substr and substr. The multi-byte variant, mb_substr, is notoriously slow because PHP doesn't know how many bytes each character is; so it needs to iterate over every character and check their length to find the byte position of a given offset.

I wrote the following benchmark to see if Python (3.8) has the same issue:

from random import randrange
from time import time

alphabet = ["a", "ü", "字", ""]
alphabet_length = len(alphabet)

string_length = 10_000_000

time_g0 = time()
rand_char_list = list(map(lambda _: alphabet[randrange(0, alphabet_length)], range(string_length)))
time_g = time() - time_g0
print("Time to generate chars:", time_g)

time_j0 = time()
rand_string = ''.join(rand_char_list)
time_j = time() - time_j0
print("Time to join chars:", time_j)

offset_i = 0
offset_m = int(string_length/2)
offset_f = string_length - 1

time_i0 = time()
utf8_char = rand_string[offset_i]
time_i = time() - time_i0
print("Time to slice initial char:", time_i)

time_m0 = time()
utf8_char = rand_string[offset_m]
time_m = time() - time_m0
print("Time to slice middle char:", time_m)

time_f0 = time()
utf8_char = rand_string[offset_f]
time_f = time() - time_f0
print("Time to slice final char:", time_f)

Which outputs the following:

Time to generate chars: 5.610808849334717
Time to join chars: 0.21134543418884277
Time to slice initial char: 9.5367431640625e-07
Time to slice middle char: 4.76837158203125e-07
Time to slice final char: 4.76837158203125e-07

I've run the test multiple times and the results are fairly consistent.

I was surprised that the join operation took such a long time, while the slice operation is strangely enough very fast. Even more surprisingly in 9/10 runs slicing the initial char is marginally slower then the other two operations.

How can this be?

Does python store the byte index of various chars in a utf8 string? Does it use a fixed (4byte) char?

kaan_a
  • 3,503
  • 1
  • 28
  • 52
  • 1
    "Looking over the specifics of the language I noticed that all strings are utf8 by default." - that is not actually true. Python strings do not represent UTF-8, and they do not use a UTF-8 representation internally. (Some strings may cache a UTF-8 encoding to save repeated conversion work, but that's not the primary representation.) – user2357112 Nov 14 '21 at 17:23
  • 1
    `rand_char_list` is not a list, but a map object, which is an iterator, so nothing gets done here. Use `list(map(...))` to actually build the list. `join` is faster when used on a list rather than a generator, see https://stackoverflow.com/questions/32462194/python-understanding-iterators-and-join-better – Thierry Lathuille Nov 14 '21 at 17:24
  • Oops you are right, I was sure that I had the list call there. Join is still kind of slow though. I'll update the question in a minute – kaan_a Nov 14 '21 at 17:29

1 Answers1

1

Python actually uses a flexible internal representation and picks the format that is most suitable for random access. For example if all characters a single bytes in UTF-8, the string will be an array of 1 byte characters. If one of them is a 2 byte character, everything will be 2 byte, and so on. The details are described here: https://www.python.org/dev/peps/pep-0393/

This means that random access to string slices should be approximately equally fast for all character sizes. Differences may arise because certain optimizations do not apply to wide characters or because conversions between the internal formats may happen lazily. For example all single ASCII character strings (just like small integer values) are not allocated dynamically but instead taken from a preallocated pool of such strings. This obviously cannot apply to the set of all 4 byte characters.

As far as string joining is concerned, its sometimes underwhelming performance is a topic for occasional discussion. See for example here: https://lwn.net/Articles/816415/

Homer512
  • 9,144
  • 2
  • 8
  • 25