77

In the following code, I create two lists with the same values: one list unsorted (s_not), the other sorted (s_yes). The values are created by randint(). I run some loop for each list and time it.

import random
import time

for x in range(1,9):

    r = 10**x # do different val for the bound in randint()
    m = int(r/2)

    print("For rand", r)

    # s_not is non sorted list
    s_not = [random.randint(1,r) for i in range(10**7)]

    # s_yes is sorted
    s_yes = sorted(s_not)

    # do some loop over the sorted list
    start = time.time()
    for i in s_yes:
        if i > m:
            _ = 1
        else:
            _ = 1
    end = time.time()
    print("yes", end-start)

    # do the same to the unsorted list
    start = time.time()
    for i in s_not:
        if i > m:
            _ = 1
        else:
            _ = 1
    end = time.time()
    print("not", end-start)

    print()

With output:

For rand 10
yes 1.0437555313110352
not 1.1074268817901611

For rand 100
yes 1.0802974700927734
not 1.1524150371551514

For rand 1000
yes 2.5082249641418457
not 1.129960298538208

For rand 10000
yes 3.145440101623535
not 1.1366300582885742

For rand 100000
yes 3.313387393951416
not 1.1393756866455078

For rand 1000000
yes 3.3180911540985107
not 1.1336982250213623

For rand 10000000
yes 3.3231537342071533
not 1.13503098487854

For rand 100000000
yes 3.311596393585205
not 1.1345293521881104

So, when increasing the bound in the randint(), the loop over the sorted list gets slower. Why?

Shivam
  • 1,345
  • 7
  • 28
fdireito
  • 1,709
  • 1
  • 13
  • 19
  • 2
    *n=10^7* might be overkill. As low as *n=10^5* gives me comparable results, and only takes about 2 seconds to run. – wjandrea Nov 12 '21 at 23:25
  • It's also strange that the timing of `not` doesn't seem to be affected by `r`. – Barmar Nov 12 '21 at 23:26
  • 2
    For those attributing to cache misses: list size is the same for all `r`, but there is no difference in runtime until numbers get over 10**100 – Marat Nov 13 '21 at 01:55
  • @Marat "no difference"? The times *triple* from 10^2 to 10^4. – no comment Nov 13 '21 at 01:56
  • @nocomment my bad, by some reason I used 10**(10**r) instead. – Marat Nov 13 '21 at 02:00
  • @Marat Not sure now what you mean, are you saying we're somehow wrong? – no comment Nov 13 '21 at 02:34
  • @nocomment no hidden message, I was wrong – Marat Nov 13 '21 at 03:00
  • 11
    Essentially the same issue as [this Java question](https://stackoverflow.com/questions/35018999/why-is-processing-a-sorted-array-slower-than-an-unsorted-array-javas-arrayl/35019909). – user2357112 Nov 13 '21 at 13:31
  • Just curious: what happens if you add `import copy` and do `s_yes = copy.deepcopy(sorted(s_not))`? – Filip Milovanović Nov 13 '21 at 17:47
  • A good way to avoid this problem would be to allocate the elements in a contiguous array, and sort it in place by swapping data. – Davislor Nov 13 '21 at 22:37
  • 2
    @Davislor: That would make no difference; Python's `list` is already contiguous, and sorting is done by swapping data. `sorted` doesn't do it in place (sort of; it makes a new `list`, then sorts *that* in place), but it's largely irrelevant; `list` is storing pointers to the various objects in it, not raw data, so both `list`s are aliasing the same objects. – ShadowRanger Nov 14 '21 at 13:33
  • @ShadowRanger `sorted` doesn't sort by swapping and doesn't sort the new list in place. What sorting algorithm do you think it's using? – no comment Nov 14 '21 at 14:30
  • 1
    @nocomment: We're both right, we're just using different contexts. Yes, internally, CPython is using TimSort (modified merge sort) that doesn't sort literally in-place, nor does it swap as it goes. I'm talking about the observable behavior from the Python layer (for `list.sort` anyway, you can't tell any of this for `sorted` since it makes the new `list` internally and you can't inspect it), where the original `list` is modified in-place, and the same objects are in it (no recreating things). If Davislor was talking about true arrays (`array` module or `numpy`) I was off-base. – ShadowRanger Nov 14 '21 at 16:34
  • 1
    Of course `array` module and `numpy` stuff would make the test code not work properly (it would be producing wrapper objects when iterating like this; the timings would match, but only because they'd both be making new objects over and over). – ShadowRanger Nov 14 '21 at 16:35
  • 1
    @ShadowRanger What would make the difference is storing the *data* contiguously, and sorting that in place. You just said that only pointers to the data are currently conttguous in memory. That’s not as efficient. – Davislor Nov 14 '21 at 17:22
  • I wouldn't use `time.time()` for code timing, that's why [`timeit`](https://docs.python.org/3/library/timeit.html) was invented. – Mark Ransom Nov 15 '21 at 17:04
  • @user2357112supportsMonica: Yup, same array of pointers as in [Why is processing a sorted array \*slower\* than an unsorted array? (Java's ArrayList.indexOf)](https://stackoverflow.com/a/35019909)), with that effect drowning out branch prediction effects ([Why is processing a sorted array faster than processing an unsorted array?](https://stackoverflow.com/q/11227809)), although the latter effect is barely visible for the small-integer lists, not quite drowned out by CPython interpreter overhead: – Peter Cordes Nov 16 '21 at 01:35
  • (Branch prediction effects are barely visible for the first two element ranges, 1..10 and 1..100, where sorted is slightly faster because CPython interns integers from -5..256, reusing objects instead of allocating a new object for the list element to point to. So cache misses are gone, leaving just branch misses (even though both sides of the branch do the same thing, the interpreter didn't optimize), mostly drowned out by CPython interpreter overhead. Unlike cache misses which are more costly than branch misses.) – Peter Cordes Nov 16 '21 at 01:36
  • 1
    @PeterCordes I wouldn't necessarily trust those mere two timings to show branch prediction effects at all, rather than random instability of their computer. But I repeated it multiple times on a fairly stable computer and the same thing happened every time. With range 100, five runs, alternating between sorted and unsorted, the sorted took 0.73 seconds all five times and the unsorted took 0.80 seconds all five times. And with `m = -1`, it was five times 0.67 vs five times 0.68. Now I'm curious about that consistent 0.01 seconds advantage. I tried `s_yes, s_not = s_not, s_yes` at the end of ... – no comment Nov 16 '21 at 02:29
  • ... the loop to rule out that the order inside the loop matters, but still always the sorted one was so very slightly faster. – no comment Nov 16 '21 at 02:29
  • @nocomment: If those times are in seconds, that's long enough for warm-up effects like CPU frequency to stabilize. Thanks for checking, that pretty much confirms it is some effect of branching differently. It's not exactly like the C/Java version, though, because CPython is an interpreter, and thus control dependencies in the Python code become data dependencies for the code doing the interpreting. But different instructions executed are again control dependencies. (Both sides of the if do the same `_ = 1`, but maybe the bytecode differs, e.g. one side doing a jump over the else) – Peter Cordes Nov 16 '21 at 02:48
  • @PeterCordes Well, whether it's enough to stabilize the CPU/system depends on whether the CPU/system *can* be stable at all. My laptop for example, when unplugged, is *always* pretty unstable, no matter how much I warm up. And we don't know what the OP used. – no comment Nov 16 '21 at 03:05
  • @nocomment: Yes, right, but if we assume not total incompetence in benchmarking methodology, then the fact that the not-sorted times are pretty stable rand 10k upwards is a good sign that their system was stable for at least that part. Although fair point about unplugged laptops; I normally only use a desktop. I guess you have a power management policy of preferring passive cooling, clocking down instead of spinning the fans. Yeah, that would not stabilize well. (When memory latency/BW isn't a big factor, `perf stat --all-user -e task-clock,cycles ./a.out` could be better than nothing.) – Peter Cordes Nov 16 '21 at 03:36
  • @PeterCordes Ah, yes, very stable at `rand` 100k+. But I still feel surer after my more tests. And they led to a [new question](https://stackoverflow.com/q/69992915/16759116). – no comment Nov 16 '21 at 16:40

3 Answers3

92

Cache misses. When N int objects are allocated back-to-back, the memory reserved to hold them tends to be in a contiguous chunk. So crawling over the list in allocation order tends to access the memory holding the ints' values in sequential, contiguous, increasing order too.

Shuffle it, and the access pattern when crawling over the list is randomized too. Cache misses abound, provided there are enough different int objects that they don't all fit in cache.

At r==1, and r==2, CPython happens to treat such small ints as singletons, so, e.g., despite that you have 10 million elements in the list, at r==2 it contains only (at most) 100 distinct int objects. All the data for those fit in cache simultaneously.

Beyond that, though, you're likely to get more, and more, and more distinct int objects. Hardware caches become increasingly useless then when the access pattern is random.

Illustrating:

>>> from random import randint, seed
>>> seed(987987987)
>>> for x in range(1, 9):
...     r = 10 ** x
...     js = [randint(1, r) for _ in range(10_000_000)]
...     unique = set(map(id, js))
...     print(f"{r:12,} {len(unique):12,}")
...     
          10           10
         100          100
       1,000    7,440,909
      10,000    9,744,400
     100,000    9,974,838
   1,000,000    9,997,739
  10,000,000    9,999,908
 100,000,000    9,999,998
Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • But is this really an issue here? The list `s_yes` is accessed in order as I see it. I believe it's the actual sorting that has to deal with the cache misses. I would expect the sorting to happen only at first access (when the `for` loop starts), at least that's how I would implement it. – arne Nov 15 '21 at 08:03
  • @arne That would be true if the loop was empty. However the loop is using the value of the elements in a condition, this required the interpreter to fetch the actual values and this is where the cache misses happen. The elements are allocated in sequence, but their value is random. So when you sort the list you end up with elements pointing to random memory locations and this means the cache loaded when fetching element `0` is useless when fetching element `1`. In the random case the cache loaded at `0` is good for `1`, `2` etc – GACy20 Nov 15 '21 at 09:58
  • @GACy20 If the loop body was empty (i.e., `pass`), you'd see the same effect. And it would be even *bigger* (ratio larger than 3), as it wouldn't be diluted by those extra statements. – no comment Nov 15 '21 at 15:12
  • 17
    @arne `sorted` sorts immediately. So before the loop and the timing starts. You seem to be a C++ person. Python ints are objects, and the list only stores their addresses. Think not `vector` but `vector`. Reading the pointers in order is cache-friendly. But cache-friendliness of the ints they point to depends on where the ints are in memory. – no comment Nov 15 '21 at 15:24
  • @nocomment That is one of the differences between C++ and python I always forget. – arne Nov 15 '21 at 15:59
  • 1
    @arne remembering that difference is critical to deeply understanding how Python works. – Mark Ransom Nov 15 '21 at 17:12
  • So, when OP would create the numbers sequentially and then use `shuffle()`, the effect would be exactly the other way round. Is my understanding correct? – Thomas Weller Nov 15 '21 at 18:46
  • 3
    @ThomasWeller, the effect in the end would be the same: the original list would be accessed in allocation order, and after shuffling it would be accessed in "random" order of memory blocks. – Tim Peters Nov 15 '21 at 18:50
  • 2
    I'll add that "sorted" here isn't really the fundamental driver: _any_ way of effecting a random-ish permutation would have same end effect. In the original, because the values were themselves random-ish, sorting effected a random-ish permutation. – Tim Peters Nov 15 '21 at 22:55
  • @MarkRansom I tend to disagree on that point. It's definitely a point to keep in mind when it comes to performance, but for the day-to-day I just don't care about memory representations. – arne Nov 16 '21 at 08:44
  • 1
    @arne my comment was about more than just performance. I find it useful to know how things work under the covers. You may certainly disagree and find Python perfectly usable without that knowledge, but I think you're missing out. – Mark Ransom Nov 16 '21 at 16:43
  • Iterating in order of allocation will also reduce page (TLB) misses. – StackedCrooked Nov 18 '21 at 19:58
35

As the others said, cache misses. Not the values/sortedness. The same sorted values, but with freshly sequentially created objects, is fast again (actually even a bit faster than the not case):

s_new = [--x for x in s_yes]

Just picking one size:

For rand 1000000
yes 3.6270992755889893
not 1.198620080947876
new 1.02010178565979

Looking at address differences from one element to the next (just 106 elements) shows that especially for s_new, the elements are nicely sequentially arranged in memory (99.2% of the time the next element came 32 bytes later), while for s_yes they're totally not (just 0.01% came 32 bytes later):

s_yes:
    741022 different address differences occurred. Top 5:
    Address difference 32 occurred 102 times.
    Address difference 0 occurred 90 times.
    Address difference 64 occurred 37 times.
    Address difference 96 occurred 17 times.
    Address difference 128 occurred 9 times.

s_not:
    1048 different address differences occurred. Top 5:
    Address difference 32 occurred 906649 times.
    Address difference 96 occurred 8931 times.
    Address difference 64 occurred 1845 times.
    Address difference -32 occurred 1816 times.
    Address difference -64 occurred 1812 times.

s_new:
    19 different address differences occurred. Top 5:
    Address difference 32 occurred 991911 times.
    Address difference 96 occurred 7825 times.
    Address difference -524192 occurred 117 times.
    Address difference 0 occurred 90 times.
    Address difference 64 occurred 37 times.

Code for that:

from collections import Counter

for s in 's_yes', 's_not', 's_new':
    print(s + ':')
    ids = list(map(id, eval(s)))
    ctr = Counter(j - i for i, j in zip(ids, ids[1:]))
    print('   ', len(ctr), 'different address differences occurred. Top 5:')
    for delta, count in ctr.most_common(5):
        print(f'    Address difference {delta} occurred {count} times.')
    print()
no comment
  • 6,381
  • 4
  • 12
  • 30
  • A better illustration for why this is about locality would be to make `s_yes = list(range(10**7))`, and `s_not = s_yes[:]`, `random.shuffle(s_not)`. Now the sorted array is also contiguously allocated, while the unsorted array is non-contiguous, so the timings should reverse. – ShadowRanger Nov 14 '21 at 13:38
  • @ShadowRanger Hmmm, I don't think that's "better". About equally good maybe. But it's further away from their data. – no comment Nov 14 '21 at 14:27
5

The answer is likely locality of data. Integers above a certain size limit are allocated dynamically. When you create the list, the integer objects are allocated from (mostly) nearby memory. So when you loop through the list, things tend to be in cache and the hardware prefetcher can put them there.

In the sorted case, the objects get shuffled around, resulting in more cache misses.

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • Most definitely. Accessing data randomly after allocating memory in a given order is slower than accessing it sequentially in the original order. Doing this in reverse shows that the sort has nothing to do with it (except by changing the access order). i.e. starting with s_yes as a range and s_not as a shuffled copy for s_yes results in longer times for s_not. – Alain T. Nov 13 '21 at 01:07
  • 4
    That "a certain size" is **256**. –  Nov 15 '21 at 21:04
  • 2
    @user17242583 subject to change, of course. I think it's already changed once in the past. – Mark Ransom Nov 16 '21 at 16:42