9

This is my environment:

$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()
$ lscpu | grep "Model name"
Model name:                      Intel(R) Core(TM) i5-4430 CPU @ 3.00GHz
$ sudo dmidecode --type memory | grep -E "^\\s+(Speed|Type):" -
    Type: DDR3
    Speed: 1600 MT/s
    Type: DDR3
    Speed: 1600 MT/s

I consistently get timing results like these on my machine:

$ python -m timeit -s "x = list(range(250)) * 4" "[*x]"
200000 loops, best of 5: 1.54 usec per loop
$ python -m timeit -s "x = [0, 1, 2, 3] * 250" "[*x]"
200000 loops, best of 5: 1.48 usec per loop
$ python -m timeit -s "x = [0, 1] * 500" "[*x]"
100000 loops, best of 5: 2 usec per loop
$ python -m timeit -s "x = [0] * 1000" "[*x]"
100000 loops, best of 5: 3.84 usec per loop

(Here I have deliberately avoided using integers larger than 255, because I know about the caching of small integers.

Similarly, with single-character strings:

$ python -m timeit -s "x = [chr(i) for i in range(250)] * 4" "[*x]"
200000 loops, best of 5: 1.81 usec per loop
$ python -m timeit -s "x = [chr(0), chr(1), chr(2), chr(3)] * 250" "[*x]"
200000 loops, best of 5: 1.5 usec per loop
$ python -m timeit -s "x = [chr(0), chr(1)] * 500" "[*x]"
100000 loops, best of 5: 2.03 usec per loop
$ python -m timeit -s "x = [chr(0)] * 1000" "[*x]"
100000 loops, best of 5: 3.83 usec per loop

(Again, I deliberately limit the character code points so that Python's flexible string representation will choose a consistent, single-byte-per-character representation.)

All of these input lists are length 1000, and the same method is being used to copy them, so I wouldn't expect dramatic differences in the timing. However, I find that the lists that consist of the same simple element (a single-character string or a small integer) over and over, take nearly twice as long to copy as anything else; that using a few different values is even faster than using two alternating values; and that using a wide variety of values is slightly slower again.

With slower copying methods, the difference is still present, but less dramatic:

$ python -m timeit -s "x = [chr(i) for i in range(250)] * 4" "[i for i in x]"
20000 loops, best of 5: 18.2 usec per loop
$ python -m timeit -s "x = [chr(0), chr(1), chr(2), chr(3)] * 250" "[i for i in x]"
20000 loops, best of 5: 17.3 usec per loop
$ python -m timeit -s "x = [chr(0), chr(1)] * 500" "[i for i in x]"
20000 loops, best of 5: 17.9 usec per loop
$ python -m timeit -s "x = [chr(0)] * 1000" "[i for i in x]"
20000 loops, best of 5: 19.1 usec per loop

Why does this occur?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • This is probably very similar to [this question](https://stackoverflow.com/questions/75597276/why-are-unordered-python-dictionary-lookups-10x-slower-than-ordered-lookups) – Barmar Mar 08 '23 at 20:16
  • I expect that it is the same sort of issue, but I intuitively expected that effects like that would work oppositely to how they apparently actually do. – Karl Knechtel Mar 08 '23 at 20:17
  • 1
    Interestingly, they are all equally slow if the resulting list consists only of 0, rather than monotonically increasing subsequences of varying length. Compare, for example, `x = [0,0]*500` and `x = [0, 1]*500`. – chepner Mar 08 '23 at 20:33
  • @chepner well, yes; I would expect a list of 1000 zeros to take the same time to copy regardless of how it was created, especially considering that it's the same object representing zero in every position in each list. I'm not timing the list construction here. Would it be clearer if I add `--`s to the command lines? – Karl Knechtel Mar 08 '23 at 20:41
  • @KarlKnechtel I believe @chepner is saying that a list of all zeros performs badly no matter how it gets created, so `x = [0, 0, 0, 0] * 250` performs as bad as `x = [0] * 1000` when doing `[*x]`. – Steven Rumbalski Mar 08 '23 at 20:48
  • 1
    You might want to try timing the list creating. It almost seems like the mono-element list isn't actually being stored as 1000 separate items, but rather a single item with a count. (Can't test further now.) – chepner Mar 08 '23 at 20:48
  • @StevenRumbalski Yes. I don't find that surprising, because "a list of all zeros" is a list of all the same actual Python `int` object representing the numeric quantity zero, so I *a priori* don't expect it to matter how it was created. There isn't, for example, a cache locality issue created by the order of different zero objects in memory, because there aren't separate objects. – Karl Knechtel Mar 08 '23 at 20:49
  • "It almost seems like the mono-element list isn't actually being stored as 1000 separate items, but rather a single item with a count." That clearly isn't the case, as demonstrated by the fact that `x = [0] * 1000000000` causes my system to allocate approximately 8GB of memory and thus thrash virtual memory. – Karl Knechtel Mar 08 '23 at 20:53
  • 1
    Another variation: `[*x]` is slower for this setup`x = [0, 1] * 500;x.sort()` when compared to `x = [0, 1] * 500`. So the list with alternating items is faster than one where they are all together. – Steven Rumbalski Mar 08 '23 at 21:01
  • @StevenRumbalski yes, it seems that it has to do with the same object appearing in consecutive positions in the input. However, I would have expected that to make things faster, not slower. (Incidentally, are the two of you reproducing the issue on other versions of Python?) – Karl Knechtel Mar 08 '23 at 21:11
  • 1
    reproduced with 3.10.8 – JonSG Mar 08 '23 at 21:23
  • I tried making `x` a tuple and the slowdown was the same. `collections.deque` shows a similar (but smaller) performance variance. – Steven Rumbalski Mar 08 '23 at 21:43
  • 2
    My best guess is that this is due to the repeated read-modify-writes to a single refcount in a tight loop inhibiting the CPU's ability to pipeline the operation. – user2357112 Mar 08 '23 at 21:50
  • 1
    I've been using 3.9, but confirmed in 3.11 as well. I started playing with lists created with `x = ([0]*m + [1]*m + [0]*m + [1]*m) * n`, with `4*m*n == 1000`. For small `n`, it stays slow, but gets faster as `m` decreases and `n` increase. The more "alternating" the list gets, the faster `[*x]` gets. – chepner Mar 08 '23 at 21:57
  • That makes sense. Every time the next element is a different object from the one before it, the refcount update for that element is on a different memory location than the previous element, removing the data dependency on the previous refcount update. – user2357112 Mar 08 '23 at 22:13
  • @user2357112 you make a compelling case, but is there any good way to verify it? – Karl Knechtel Mar 08 '23 at 22:18
  • For full confidence, you'd probably have to run Python with a profiler (a C code profiler, not a Python code profiler) and see how much time is spent on the refcount updates in `list_extend` and `list_dealloc`. – user2357112 Mar 08 '23 at 22:23
  • This is the reason why I sometimes use `[0,1,2,3] * n` when I want a fast iterable for benchmarking. [Previous experiment](https://stackoverflow.com/a/42110976/12671057) showing this. – Kelly Bundy Mar 08 '23 at 22:34

1 Answers1

10

TL;DR: the performance gap is due to a complex low-level effect related to how modern processors execute instructions. More precisely, the performance is dependent to the number of store-forwarding operations that can be executed in parallel regarding the number of objects in the target list : having more objets results in more independent instructions so more instructions that can be executed in parallel and a smaller critical path of the dependency chain of the instructions to be executed.


To understand what is going on, we first need to analyse what is the assembly code executed by the processor. We also need to understand quite well how modern processors work. Finally, a a deep analysis can help us to untangle what is precisely going on.


Configuration & reproducibility

On my machine, I am able to reproduce the problem on Windows 10 using CPython 3.8.1, a i5-9600KF processor and two RAM DIMM @ 3200 MHz (CL16). Here are the timings:

1.35 µs ± 51.7 ns per loop (mean ± std. dev. of 1000 runs, 2000 loops each)
1.24 µs ± 50.6 ns per loop (mean ± std. dev. of 1000 runs, 2000 loops each)
1.41 µs ± 54 ns per loop (mean ± std. dev. of 1000 runs, 2000 loops each)
2.49 µs ± 89.8 ns per loop (mean ± std. dev. of 1000 runs, 1000 loops each)

High-level profiling & executed assembly code

In all cases, two functions take most of the computing time. The two functions contains a hot loop taking almost all the execution time of the benchmark. Here is the assembly code of the slowest one (~53% of the time):

0x180053f50  Block 10:
0x180053f50      mov rcx, qword ptr [rdx+rax*1]
0x180053f54      lea rax, ptr [rax+0x8]
0x180053f58      inc rbx
0x180053f5b      inc qword ptr [rcx]
0x180053f5e      mov qword ptr [rax-0x8], rcx
0x180053f62      cmp rbx, rdi
0x180053f65      jl 0x18053f50 <Block 10>

For those who are not very familiar with the assembly language, here is a C-like code doing a similar operation:

uint64_t rdx = ...;
uint64_t rax = ...;

for(uint64_t rbx=0 ; rbx<rdi ; ++rbx)
{
    uint64_t* rcx = *(uint64_t*)(rdx+rax);  // Read an array item (list reference?)
    rax += 8;
    (*rcx)++;                               // Increment the number of reference of the object?
    *(uint64_t*)(rax-8) = rcx;              // Write the item in another array (list reference?)
}

And here is the one of the faster one (~37% of the time):

0x180058691  Block 11:                   
0x180058691      mov rbx, qword ptr [rsi+0x18]
0x180058695      mov rbx, qword ptr [rbx+rdi*8]
0x180058699      test rbx, rbx
0x18005869c      jz 0x1800586a8 <Block 13>

0x18005869e  Block 12:
0x18005869e      sub qword ptr [rbx], 0x1
0x1800586a2      jz 0x180058764 <Block 26>       (almost no jump: block returning
                                                  the result of the function)

0x1800586a8  Block 13:
0x1800586a8      sub rdi, 0x1
0x1800586ac      jns 0x180058691 <Block 11>
             [...]

Again, here is a similar C-like code to better understand what the hot loop actually does:

for(uint64_t rdi=... ; rdi>0 ; --rdi)
{
    uint64_t* tmp = *(uint64_t*)(rsi + 0x18);
    uint64_t* rbx = tmp[rdi];

    if(rbx != NULL)
    {
        // rbx is certainly a pointer to an object and 
        // *rbx is certainly the number of references.
        // If there is no reference on the object, 
        // then we (temporary?) stop the loop 
        // (very rare case in practice).
        if(--(*rbx) == 0)
            goto Block26;
    }
}

Block26:
    // Make few checks and call a function which
    // certainly deallocates the object.
    // Can jump back in the loop or stop the functions
    // regarding a complex set of conditional branches.

Note that most of the remaining time (~10%) is spent in a CancelloEx function certainly (indirectly) called by CPython in order to track interruptions in the middle of the computation (so that a Ctrl+C can quickly stop the computation).

When the number of items in the repeated list is smaller, few instructions are becoming significantly slower than the others and they are clearly becoming the bottleneck.


Overview of how modern processors work

Before trying to analysis what is happening when the processor execute this code, we need to understand few things about how modern mainstream processors works. Modern x86-64 processors are able to execute several instructions at the same time thanks to the instruction-level parallelism (ILP), combined with an out-of-order execution (OoO). Instructions that can be executed in parallel independently of others are very cheap while the one on a dependency chain are much more expensive, especially when they are interacting with the memory hierarchy (due to the relatively high latency of the memory operations). In fact, this is a bit more complex since instructions are not directly executed : they are split into one or multiple micro-instructions (uops). Some instructions can be macro-fused so they are read by the processor as a unique big instruction. Some micro-instruction can be fused too. The uops are scheduled on multiple computing units (ports) that can compute them in parallel. Some ports are specialized to execute some specific instructions (e.g. only branches, only memory reads). Conditional branches having a predictable behaviour can be predicted by the processor so a jump is nearly free in this case, especially when it jumps to a close location (and the target instructions are already in the instruction cache).


Low-level profiling

We can use precise hardware performance counters combined with a sampling-based profiler so to find out which instruction is actually the bottleneck in each case (VTune on Windows and Perf on Linux are able to do that).

Here is the result in the first case (fast):

enter image description here

enter image description here

Here is the result in the last one (slow):

enter image description here

enter image description here

I put in blue the instructions that are the slowest in the two loop (by a far margin). The cost of the instruction is provided on the right. The absolute timings are not really important : what mainly matters is the time taken by an instruction relative to the others.


Analysis of the low-level profiling results

In the slowest function, we can see that two instructions are the bottleneck, especially in the last benchmark : inc rbx and mov qword ptr [rax-0x8], rcx. inc rbx is generally cheap (1 cycle) but is on the critical path of the dependency chain. Thus, it can limit the speed of the loop regarding the scheduling of the uops on the available ports. That being said, there is no reason for this instruction to be a bigger issue in the last benchmark. In fact, the relative time taken by this instruction is smaller in the last benchmark since it is slower. However, the second instruction is getting relatively much slower : this is the real bottleneck of this function.

One should remember that processors execute instructions in parallel with a (very large) OoO window. Thus, in practice an instruction can be reported as being one on which the processor stall while it is actually bounded by the previous dependent instruction. As a result, inc qword ptr [rcx] can be responsible for the other to be slow. In fact, regarding the above reverse-engineered C code, we can deduce that rcx is a pointer to one of the object coming from the initial repeated list. The thing is incrementing the same variable in the same cache line 4 times is slower than incrementing 4 different variables, especially if they are located on 4 different cache lines on modern x86-64 processors (at least Intel ones). This is because the L1 cache has a latency of typically 3-4 cycles and the latency has to be paid to read the value, and then to write it, read it again and so on.

Note though that modern x86-64 processors are able to read a variable faster from the cache if it has been written recently. Indeed, the processor can detect the dependency and they can transfer the value from a register to another rather than reloading data from the cache so to reduce the latency of the write + read. This operation called store-forwarding is not free though. AFAIK, the latency of such an operation is of typically 4-5 cycles on recent Intel processors so it just mitigate the overheads but the latency is large enough to be the main bottleneck (for this function). For more information please read this post about caches and the store-forwarding or even this great article about store-forwarding.

Regarding the second function, we can use the same logic and conclude that jz 0x180058764 <Block 26> is the main bottleneck certainly because of the sub qword ptr [rbx], 0x1 which is also expensive due to the same reason: the amount of parallelism of the store-forwarding operations limit the speed of the loop. When there is only one object in the initial list, the critical path of the dependency chain of the instructions to execute is much longer.

Note that a 5 cycle loop iteration done twice results in a 2*5*1000/4.5e9 = 2.22 µs time. This is consistant with the experimental results of the last benchmark (one object). Note the other instructions can be mostly executed in parallel (i.e. we cannot simply add timings). With two objects, the time is 1.11 µs and the processor can better overlap the latency with many instructions (though it is not perfect). With 4 objets or more, the latency of the store-forwarding should be mostly overlapped with other instructions. This explain why the low-level profiling results show more instructions being the ones one which the processor stalls. The stores are still relatively expensive since my processor can only execute 1 store per cycle, not to mention the processor track the dependency with all other reads in the loop (and use speculation for this to be fast).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    The two hot loops are [this one](https://github.com/python/cpython/blob/v3.8.10/Objects/listobject.c#L919-L923) in `list_extend`, to copy data into the new list and update the refcounts, and [this one](https://github.com/python/cpython/blob/v3.8.10/Objects/listobject.c#L371-L373) in `list_dealloc`, to decref the element references before freeing the list. – user2357112 Mar 09 '23 at 02:29
  • Your Python times show a speed ratio of about 2. Your assembly times show a much smaller speed ratio. Looking at those big times: `(93+434+91+276) / (151+315+158+144)` ≈ 1.16. What's the explanation? – Kelly Bundy Mar 09 '23 at 12:24
  • I think I got most of what you wrote, its better to have few objects than single. Why though `x = [0, 0, 0, 1] * 250" "[*x]"` is slower than `x = [0, 0, 0, 0] * 250" "[*x]"`? – dankal444 Mar 09 '23 at 13:38
  • @dankal444 Is it? For me it's *faster* (3.0 μs vs 4.0 μs). – Kelly Bundy Mar 09 '23 at 13:46
  • @KellyBundy yes, `[0, 1, 2, 3]` faster than `[0, 1, 0, 1]` faster than `[0, 0, 0, 0]` faster than `[0, 0, 0, 1]` (with 1.33μs, 1.63μs, 1.70μs, 2.06μs respectively). I have relatively new computer - Im on Windows 11, i7-11800H. – dankal444 Mar 09 '23 at 15:48
  • 1
    @KellyBundy This is not the time of the whole function but the time of a slice during approximately 1 seconde. Since the last benchmark use-case is slower, this is why I specifically state in the answer "*The absolute timings are not really important : what mainly matters is the time taken by an instruction relative to the others.*". VTune chose to provide timings but I find this quite miss-leading in many cases. I prefer Perf which provide a percentage or the number of samples. That being said, Perf failed to provide any useful information here (the assembly code was missing). – Jérôme Richard Mar 09 '23 at 22:25
  • 1
    @KellyBundy You can interpret the value not as being milliseconds but as a permille value. The timing breakdown can certainly be approximated by premultiplying the value by the provided benchmark timing in the answer. – Jérôme Richard Mar 09 '23 at 22:28
  • @dankal444 I got the exact same timings for `[0, 1, 0, 1] * 250` then for `[0, 1] * 500` with a less than 1% difference (non significant). The same thing for `[0, 0, 0, 0] * 250` and `[0] * 1000` with a 1% difference (still non significant since the std-dev is about 3%). As for `[0, 0, 0, 1] * 250` the timing is between the two others (~1.3 faster than `[0, 0, 0, 0] * 250`). This is perfectly expected. Results can change from one kind of processor to another as vendors can implement new optimizations (and some of them are not published). Still your kind of processor is close to mine. – Jérôme Richard Mar 09 '23 at 22:40
  • 1
    @dankal444 Ha, in fact, it is not so close. I though Tiger-lake was a predecessor of Ice-lake and contemporary to the Skylake/Coffee-lake/Cannon-lake/etc-lake that are very close to each other (too many of those lake). Tiger-lake uses Willow Cove cores which are an improved version of Sunny Cove cores (with minor changes besides bigger L2/L3 caches and a different associativity as well as a LPDDR5 support). Sunny Cove cores are pretty different from my Coffe-lake cores (which are basically the same as Skylake ones : Palm Cove one). – Jérôme Richard Mar 09 '23 at 22:57
  • 1
    @dankal444 Sunny Cove have a completely different memory sub-system and backend/frondend. I heard that there is a data cache between registers and the L1 cache that is meant to speed up some access (so to reduce the latency of access to the same few cache lines). AFAIK, this cache is not published. I am sceptical about whether this cache actually exists but it could explain your results. Besides, did you care about frequency scaling and turbo boost? A i7-11800H should be pretty sensitive to that (as I guess this is a notebook processor). – Jérôme Richard Mar 09 '23 at 23:12
  • 1
    @JérômeRichard Thanks! I find it interesting, it proves that some "optimizations" should be first tested on the target machine before applied. So those possible registers (or something else) help to speed-up [0, 0, 0, 0] ahead of [0, 0, 0, 1], but not enough to improve over [0,1,0,1]. Turbo boost is enabled, but during tests CPU gets to around max frequency and my results are quite repeatable(1-3% diff). Good idea to check it though – dankal444 Mar 10 '23 at 10:33