I'm a PHP programmer who will be moving over to Python soon.
In the PHP world there are two functions for dealing with string slices mb_substr
and substr
. The multi-byte variant, mb_substr
, is notoriously slow because PHP doesn't know how many bytes each character is; so it needs to iterate over every character and check their length to find the byte position of a given offset.
I wrote the following benchmark to see if Python (3.8) has the same issue:
from random import randrange
from time import time
alphabet = ["a", "ü", "字", ""]
alphabet_length = len(alphabet)
string_length = 10_000_000
time_g0 = time()
rand_char_list = list(map(lambda _: alphabet[randrange(0, alphabet_length)], range(string_length)))
time_g = time() - time_g0
print("Time to generate chars:", time_g)
time_j0 = time()
rand_string = ''.join(rand_char_list)
time_j = time() - time_j0
print("Time to join chars:", time_j)
offset_i = 0
offset_m = int(string_length/2)
offset_f = string_length - 1
time_i0 = time()
utf8_char = rand_string[offset_i]
time_i = time() - time_i0
print("Time to slice initial char:", time_i)
time_m0 = time()
utf8_char = rand_string[offset_m]
time_m = time() - time_m0
print("Time to slice middle char:", time_m)
time_f0 = time()
utf8_char = rand_string[offset_f]
time_f = time() - time_f0
print("Time to slice final char:", time_f)
Which outputs the following:
Time to generate chars: 5.610808849334717
Time to join chars: 0.21134543418884277
Time to slice initial char: 9.5367431640625e-07
Time to slice middle char: 4.76837158203125e-07
Time to slice final char: 4.76837158203125e-07
I've run the test multiple times and the results are fairly consistent.
I was surprised that the join operation took such a long time, while the slice operation is strangely enough very fast. Even more surprisingly in 9/10 runs slicing the initial char is marginally slower then the other two operations.
How can this be?
Does python store the byte index of various chars in a utf8 string? Does it use a fixed (4byte) char?