91

Copying a shuffled range(10**6) list ten times takes me about 0.18 seconds: (these are five runs)

0.175597017661
0.173731403198
0.178601711594
0.180330912952
0.180811964451

Copying the unshuffled list ten times takes me about 0.05 seconds:

0.058402235973
0.0505464636856
0.0509734306934
0.0526022752744
0.0513324916184

Here's my testing code:

from timeit import timeit
import random

a = range(10**6)
random.shuffle(a)    # Remove this for the second test.
a = list(a)          # Just an attempt to "normalize" the list.
for _ in range(5):
    print timeit(lambda: list(a), number=10)

I also tried copying with a[:], the results were similar (i.e., big speed difference)

Why the big speed difference? I know and understand the speed difference in the famous Why is it faster to process a sorted array than an unsorted array? example, but here my processing has no decisions. It's just blindly copying the references inside the list, no?

I'm using Python 2.7.12 on Windows 10.

Edit: Tried Python 3.5.2 as well now, the results were almost the same (shuffled consistently around 0.17 seconds, unshuffled consistently around 0.05 seconds). Here's the code for that:

a = list(range(10**6))
random.shuffle(a)
a = list(a)
for _ in range(5):
    print(timeit(lambda: list(a), number=10))
Community
  • 1
  • 1
Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
  • In order to eliminate "external impact" (such as Python interpreter's internal state, caching heuristics of the underlying HW architecture, etc) - try to swap these two tests (i.e., change the order in which you execute them), and see if your measurements are consistent. – barak manos Feb 08 '17 at 07:52
  • I tried it myself, and their order does seem to affect the measurements. So I would guess that it has something to do with the Python interpreter's internal state. – barak manos Feb 08 '17 at 07:54
  • @barakmanos They were separate runs of the script. Also, I already do five runs each to try to eliminate other impacts. And yes, I do get this consistently. – Stefan Pochmann Feb 08 '17 at 07:55
  • Same as barak manos, when I tried running `timeit` on IPython I can't see a difference either – Greg Feb 08 '17 at 07:56
  • 24
    [Why sorting an array makes a Python loop faster](https://rickystewart.wordpress.com/2013/09/03/why-sorting-an-array-makes-a-python-loop-faster/) – vaultah Feb 08 '17 at 07:58
  • Well, once I change the order of these two tests, I am no longer able to reproduce your measurements. When I change the order and run the tests, I get identical measurements for both of them (i.e., different than what you have observed). And then, even if I change the order back to the original settings, I still get identical measurements. – barak manos Feb 08 '17 at 08:01
  • @barakmanos **There is no order!** As I already said, they were separate runs of the script (editing the script in between). And I can change forth and back and it consistently alternates between slow and fast. – Stefan Pochmann Feb 08 '17 at 08:04
  • If anything it's faster to copy a sorted array `In [30]: %timeit -n1000 copy(a_sorted) 1000 loops, best of 3: 11.5 ms per loop In [31]: %timeit -n1000 copy(a) 1000 loops, best of 3: 40.7 ms per loop` – Greg Feb 08 '17 at 08:05
  • @Greg Your previous comment and the "If anything" sounds like you disagree with me, but your result agrees with mine. I'm confused. – Stefan Pochmann Feb 08 '17 at 08:08
  • 5
    Please don't shout at me, I was trying to help you! After changing the order, I get approximately `0.25` in each iteration of each one of the tests. So on my platform, the order does matter. – barak manos Feb 08 '17 at 08:14
  • @StefanPochmann, yes I was confusing myself :D. I mixed up shuffled and sorted. But yea, you are correct, I had not run the perf test enough times initially. – Greg Feb 08 '17 at 08:15
  • 1
    @vaultah Thanks, but I've read that now and I disagree. When I saw the code there, I immediately thought of cache hits/misses of the ints, which is also the author's conclusion. But his code **adds** the numbers, which requires looking at them. My code doesn't. Mine only needs to copy the references, not access through them. – Stefan Pochmann Feb 08 '17 at 08:21
  • 2
    There are an complete answer in a link by @vaultah (you slightly disagree right now, I see). But anyway I still think that we shouldn't use python for low-level features, and thus be worry about. But that topic is interesting anyway, thank you. – Nikolay Prokopyev Feb 08 '17 at 08:21
  • @barakmanos Sorry, wasn't aware that that's considered shouting (only know about CAPS shouting). – Stefan Pochmann Feb 08 '17 at 08:22
  • 1
    @NikolayProkopyev Yeah, I'm not worried about it, just noticed this while doing something else, couldn't explain it, and got curious. And I'm glad I asked and have an answer now :-) – Stefan Pochmann Feb 08 '17 at 11:01
  • 1
    Very interesting find! You should tag this with either `cpython` or `python-internals` since it boils down to an implementation detail, we shouldn't give an impression that this applies to all Pythons and will be present in all future releases. – Dimitris Fasarakis Hilliard Feb 08 '17 at 12:49
  • @JimFasarakis-Hilliard Ok I did. Went with `python-internals` because I'm not sure it's just CPython and not sure that this will forever be like this in CPython. – Stefan Pochmann Feb 08 '17 at 13:32

4 Answers4

101

The interesting bit is that it depends on the order in which the integers are first created. For example instead of shuffle create a random sequence with random.randint:

from timeit import timeit
import random

a = [random.randint(0, 10**6) for _ in range(10**6)]
for _ in range(5):
    print(timeit(lambda: list(a), number=10))

This is as fast as copying your list(range(10**6)) (first and fast example).

However when you shuffle - then your integers aren't in the order they were first created anymore, that's what makes it slow.

A quick intermezzo:

  • All Python objects are on the heap, so every object is a pointer.
  • Copying a list is a shallow operation.
  • However Python uses reference counting so when an object is put in a new container it's reference count must be incremented (Py_INCREF in list_slice), so Python really needs to go to where the object is. It can't just copy the reference.

So when you copy your list you get each item of that list and put it "as is" in the new list. When your next item was created shortly after the current one there is a good chance (no guarantee!) that it's saved next to it on the heap.

Let's assume that whenever your computer loads an item in the cache it also loads the x next-in-memory items (cache locality). Then your computer can perform the reference count increment for x+1 items on the same cache!

With the shuffled sequence it still loads the next-in-memory items but these aren't the ones next-in-list. So it can't perform the reference-count increment without "really" looking for the next item.

TL;DR: The actual speed depends on what happened before the copy: in what order were these items created and in what order are these in the list.


You can verify this by looking at the id:

CPython implementation detail: This is the address of the object in memory.

a = list(range(10**6, 10**6+100))
for item in a:
    print(id(item))

Just to show a short excerpt:

1496489995888
1496489995920  # +32
1496489995952  # +32
1496489995984  # +32
1496489996016  # +32
1496489996048  # +32
1496489996080  # +32
1496489996112
1496489996144
1496489996176
1496489996208
1496489996240
1496507297840
1496507297872
1496507297904
1496507297936
1496507297968
1496507298000
1496507298032
1496507298064
1496507298096
1496507298128
1496507298160
1496507298192

So these objects are really "next to each other on the heap". With shuffle they aren't:

import random
a = list(range(10**6, 100+10**6))
random.shuffle(a)
last = None
for item in a:
    if last is not None:
        print('diff', id(item) - id(last))
    last = item

Which shows these are not really next to each other in memory:

diff 736
diff -64
diff -17291008
diff -128
diff 288
diff -224
diff 17292032
diff -1312
diff 1088
diff -17292384
diff 17291072
diff 608
diff -17290848
diff 17289856
diff 928
diff -672
diff 864
diff -17290816
diff -128
diff -96
diff 17291552
diff -192
diff 96
diff -17291904
diff 17291680
diff -1152
diff 896
diff -17290528
diff 17290816
diff -992
diff 448

Important note:

I haven't thought this up myself. Most of the informations can be found in the blogpost of Ricky Stewart.

This answer is based on the "official" CPython implementation of Python. The details in other implementations (Jython, PyPy, IronPython, ...) may be different. Thanks @JörgWMittag for pointing this out.

Community
  • 1
  • 1
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • 6
    @augurar Copying a reference implies incrementing the reference counter which is in the object (thus the object access is inevitable) – Leon Feb 08 '17 at 08:30
  • Good additional test, thanks. But it still doesn't explain it. It's similar to the article that @vaultah linked to in the comments. And as I commented there, I'm not "loading" (your term, not sure what exactly you mean) the integers, I'm only copying the references. – Stefan Pochmann Feb 08 '17 at 08:30
  • @Leon Ok, that makes sense. Do you have a link documenting that? – Stefan Pochmann Feb 08 '17 at 08:32
  • @Leon so you want to say that any operation with reference in the real makes us accessing the object itself?? That's so tricky and kinda seems useless for assumed performance bonus from working with pointers. – Nikolay Prokopyev Feb 08 '17 at 08:33
  • 1
    @StefanPochmann The function doing the copy is [`list_slice`](https://github.com/python/cpython/blob/403ccddb9598bac6e0e6db4ba5aa2fe494512a98/Objects/listobject.c#L432) and in line 453 you can see the `Py_INCREF(v);` call that needs to access the heap-allocated object. – MSeifert Feb 08 '17 at 08:38
  • @StefanPochmann The best overview that I could find about reference counting in Python is this: http://edcjones.tripod.com/refcount.html – Leon Feb 08 '17 at 08:42
  • @Leon I only read parts of it now (too detailed for me) and it's old, but I found the relevant bit (the reference count being included inside the object) [in the current code](https://github.com/python-git/python/blob/master/Include/object.h#L78). Not that I doubted it, as it immediately made perfect sense when you said it. Thanks. – Stefan Pochmann Feb 08 '17 at 09:14
  • 1
    @MSeifert Another good experiment is using `a = [0] * 10**7` (up from 10**6 because that was too unstable), which is even faster than using`a = range(10**7)` (by a factor of about 1.25). Clearly because that's even better for caching. – Stefan Pochmann Feb 08 '17 at 09:22
  • 1
    I was just wondering why I got 32bit integers on a 64bit computer with python 64bit. But actually that's good for caching as well :-) Even `[0,1,2,3]*((10**6) // 4)` is as fast as `a = [0] * 10**6`. However with integers from 0-255 there is another fact coming in: These are interned so with these the order of creation (inside your script) isn't important anymore - because they are created when you start python. – MSeifert Feb 08 '17 at 09:33
  • 2
    Note that of the currently existing four production-ready Python implementations, only *one* uses reference counting. So, this analysis really only applies to a single implementation. – Jörg W Mittag Feb 08 '17 at 09:50
  • @MSeifert I don't think the interning plays a big role here, though. Interestingly, for me `[0,1,2,3]*((10**7) // 4)` is even *faster* than `[0]*(10**7)` (again I went up to 10**7 for stability). Faster by a factor of about 1.32. I'm doing more tests now... – Stefan Pochmann Feb 08 '17 at 09:51
  • @MSeifert Finished more tests now and posted them in my separate answer. Any idea why repeating `[0,1,2,3]` is significantly faster than repeating `[0]` for me but not for you? – Stefan Pochmann Feb 08 '17 at 10:52
  • 1
    Great answer, if you can, make another "important note" highlighting that this behavior is *specific* to `C`Python. – Dimitris Fasarakis Hilliard Feb 08 '17 at 12:50
  • 1
    Maybe worth noting that `deepcopy` yields the same. I tried to change the `a = list(a)` by `a = deepcopy(a)` but the problem is the same, as `deepcopy` seems to ignore numeric stuff (keeps original memory locations). Doing something like `a = map(lambda x: (x - 1) + 1, a)` makes the timing difference disappear (as it is effectively regenerating the objects in a new place). – MariusSiuram Feb 08 '17 at 15:47
  • 1
    So, TL;DR **cache misses are expensive**? This kind of thing is so cool to me - it's like in physics when they observe quantum effects on the scale of classical mechanics :D Same thing with virtual memory and page faults. Related: [cache oblivious data structures](https://blogs.msdn.microsoft.com/devdev/2007/06/12/cache-oblivious-data-structures/) – Blackhawk Feb 08 '17 at 15:48
24

When you shuffle the list items, they have worse locality of reference, leading to worse cache performance.

You might think that copying the list just copies the references, not the objects, so their locations on the heap shouldn't matter. However, copying still involves accessing each object in order to modify the refcount.

augurar
  • 12,081
  • 6
  • 50
  • 65
  • This might be a better answer for *me* (at least if it had a link to "proof" like MSeifert's) as this is all I was missing and it's very succinct, but I think I'll stick with MSeifert's as I feel it might be better for others. Upvoted this as well, though, thanks. – Stefan Pochmann Feb 08 '17 at 09:31
  • Will also add that pentioids, athlums etc have mystical logic in them to detect address patterns, and will start prefetching data when they see a pattern. Which in this case, could be kicking in to prefetch the data (reducing cache misses) when the numbers are in order. This effect is in addition, of course, to the increased % of hits from locality. – greggo Feb 08 '17 at 23:01
5

As explained by others, it's not just copying the references but also increases the reference counts inside the objects and thus the objects are accessed and the cache plays a role.

Here I just want to add more experiments. Not so much about shuffled vs unshuffled (where accessing one element might miss the cache but get the following elements into the cache so they get hit). But about repeating elements, where later accesses of the same element might hit the cache because the element is still in the cache.

Testing a normal range:

>>> from timeit import timeit
>>> a = range(10**7)
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[5.1915339142808925, 5.1436351868889645, 5.18055115701749]

A list of the same size but with just one element repeated over and over again is faster because it hits the cache all the time:

>>> a = [0] * 10**7
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[4.125743135926939, 4.128927210087596, 4.0941229388550795]

And it doesn't seem to matter what number it is:

>>> a = [1234567] * 10**7
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[4.124106479141709, 4.156590225249886, 4.219242600790949]

Interestingly, it gets even faster when I instead repeat the same two or four elements:

>>> a = [0, 1] * (10**7 / 2)
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[3.130586101607932, 3.1001001764957294, 3.1318465707127814]

>>> a = [0, 1, 2, 3] * (10**7 / 4)
>>> [timeit(lambda: list(a), number=100) for _ in range(3)]
[3.096105435911994, 3.127148431279352, 3.132872673690855]

I guess something doesn't like the same single counter increased all the time. Maybe some pipeline stall because each increase has to wait for the result of the previous increase, but this is a wild guess.

Anyway, trying this for even larger numbers of repeated elements:

from timeit import timeit
for e in range(26):
    n = 2**e
    a = range(n) * (2**25 / n)
    times = [timeit(lambda: list(a), number=20) for _ in range(3)]
    print '%8d ' % n, '  '.join('%.3f' % t for t in times), ' => ', sum(times) / 3

The output (first column is the number of different elements, for each I test three times and then take the average):

       1  2.871  2.828  2.835  =>  2.84446732686
       2  2.144  2.097  2.157  =>  2.13275338734
       4  2.129  2.297  2.247  =>  2.22436720645
       8  2.151  2.174  2.170  =>  2.16477771575
      16  2.164  2.159  2.167  =>  2.16328197911
      32  2.102  2.117  2.154  =>  2.12437970598
      64  2.145  2.133  2.126  =>  2.13462250728
     128  2.135  2.122  2.137  =>  2.13145065221
     256  2.136  2.124  2.140  =>  2.13336283943
     512  2.140  2.188  2.179  =>  2.1688431668
    1024  2.162  2.158  2.167  =>  2.16208440826
    2048  2.207  2.176  2.213  =>  2.19829998424
    4096  2.180  2.196  2.202  =>  2.19291917834
    8192  2.173  2.215  2.188  =>  2.19207065277
   16384  2.258  2.232  2.249  =>  2.24609975704
   32768  2.262  2.251  2.274  =>  2.26239771771
   65536  2.298  2.264  2.246  =>  2.26917420394
  131072  2.285  2.266  2.313  =>  2.28767871168
  262144  2.351  2.333  2.366  =>  2.35030805124
  524288  2.932  2.816  2.834  =>  2.86047313113
 1048576  3.312  3.343  3.326  =>  3.32721167007
 2097152  3.461  3.451  3.547  =>  3.48622758473
 4194304  3.479  3.503  3.547  =>  3.50964316455
 8388608  3.733  3.496  3.532  =>  3.58716466865
16777216  3.583  3.522  3.569  =>  3.55790996695
33554432  3.550  3.556  3.512  =>  3.53952594744

So from about 2.8 seconds for a single (repeated) element it drops to about 2.2 seconds for 2, 4, 8, 16, ... different elements and stays at about 2.2 seconds until the hundred thousands. I think this uses my L2 cache (4 × 256 KB, I have an i7-6700).

Then over a few steps, the times go up to 3.5 seconds. I think this uses a mix of my L2 cache and my L3 cache (8 MB) until that's "exhausted" as well.

At the end it stays at around 3.5 seconds, I guess because my caches don't help with the repeated elements anymore.

Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
0

Before the shuffle, when allocated in the heap, the adjacent index objects are adjacent in memory, and the memory hit rate is high when accessed; after shuffle, the object of the adjacent index of the new list is not in memory. Adjacent, the hit rate is very poor.

xws
  • 21
  • 1