Python - efficient copying of list sub-range

Question

I have a list l and non-negative integers n, start0, start1.

I need most efficient (fastest, high-speed) way of copying (moving) l[start0 : start0 + n] to l[start1 : start1 + n].

I need a shallow copy (only copying references) for my current case, but would be interesting how to make fast a deep copy of range too.

Is next the most efficient and shortest way?

l[start1 : start1 + n] = l[start0 : start0 + n]

As left and right sides have same range length n in theory Python can do optimizations by just copying this data without creating any intermediate lists or resizing/re-allocating original list. But does python do such optimizations?

I've tried to do next measurements. I'm creating lists of different lengths while copying just short sub-range, also doing three variants - copying same-length sub-range, shorter sub-range, and longer sub-range. In theory for same length range python can just copy short data, which is very efficient, while for shorter or longer subranges python should move and/or re-allocate whole large array.

Try it online!

# Needs: python -m pip install timerit
import math, timerit

for lsize in range(5, 27, 3):
    size = 1 << lsize
    for n in [10, 9, 11]:
        print(f'size = {str(size).rjust(8)}, n = {str(n).rjust(2)}: ', end = '', flush = True)
        for t in timerit.Timerit(num = math.ceil(10 ** 5 / math.sqrt(size)), verbose = 1):
            l = [0] * size
            with t:
                l[10 : 20] = l[0 : n]

outputs:

size =       32, n = 10: Timed best=1.026 µs, mean=1.464 ± 0.2 µs
size =       32, n =  9: Timed best=1.026 µs, mean=1.528 ± 0.2 µs
size =       32, n = 11: Timed best=1.539 µs, mean=1.544 ± 0.1 µs
size =      256, n = 10: Timed best=1.026 µs, mean=1.532 ± 0.1 µs
size =      256, n =  9: Timed best=1.539 µs, mean=1.546 ± 0.1 µs
size =      256, n = 11: Timed best=1.539 µs, mean=2.172 ± 0.2 µs
size =     2048, n = 10: Timed best=1.539 µs, mean=1.585 ± 0.1 µs
size =     2048, n =  9: Timed best=2.565 µs, mean=2.580 ± 0.1 µs
size =     2048, n = 11: Timed best=4.105 µs, mean=4.375 ± 0.3 µs
size =    16384, n = 10: Timed best=2.052 µs, mean=2.125 ± 0.2 µs
size =    16384, n =  9: Timed best=11.802 µs, mean=12.107 ± 0.3 µs
size =    16384, n = 11: Timed best=12.315 µs, mean=12.874 ± 0.7 µs
size =   131072, n = 10: Timed best=3.592 µs, mean=3.907 ± 0.2 µs
size =   131072, n =  9: Timed best=96.988 µs, mean=100.387 ± 1.4 µs
size =   131072, n = 11: Timed best=851.852 µs, mean=873.135 ± 13.3 µs
size =  1048576, n = 10: Timed best=9.750 µs, mean=10.434 ± 0.3 µs
size =  1048576, n =  9: Timed best=1.119 ms, mean=1.127 ± 0.0 ms
size =  1048576, n = 11: Timed best=6.986 ms, mean=7.054 ± 0.0 ms
size =  8388608, n = 10: Timed best=10.263 µs, mean=10.862 ± 0.4 µs
size =  8388608, n =  9: Timed best=9.752 ms, mean=9.813 ± 0.0 ms
size =  8388608, n = 11: Timed best=58.158 ms, mean=58.692 ± 0.4 ms
size = 67108864, n = 10: Timed best=12.829 µs, mean=13.548 ± 0.4 µs
size = 67108864, n =  9: Timed best=78.876 ms, mean=79.061 ± 0.1 ms
size = 67108864, n = 11: Timed best=471.014 ms, mean=473.688 ± 1.9 ms

As we can see indeed for same-length (10) ranges for long lists it takes microseconds to copy, while shorter (9) / longer (11) ranges copying takes milliseconds. Also copying same-length for very short and very long lists take not-very-different times (meaning not that different like thousands times as for case-9 and case-11). Also replacing range with shorter range is significantly faster than replacing range with longer range, because first needs just to copy whole array's data, while second needs also to re-allocate array as it needs a larger array to store new data.

Measurements above I did intentionally just to check that Python is clever enough to not re-create whole array when same-length-to-same-length is copied. Because next non-clever naive implementation of Python's list copying can also be possible - 1) create new temporary array with copy of content of l[0 : start0], 2) create new temporary array with copy of content of l[start1 : start1 + n1], 3) create new temporary array with copy of content of l[start0 : start0 + n0], 4) concatenate these 1)-3) into new large array. Of cause I was almost sure python doesn't do that, but just to re-check I did measurements. And of cause only n elements are copied only for the case when slice expression on right has same length as on left. If left and right differ in size than half (on average) or whole array data should be copied, as it was shown by measurements of cases 9 and 11.

So seems that Python actually does optimizations for same-length ranges (meaning that whole original list is not re-allocated from scratch, only range values (references) are copied). Any documentational proofs?

Maybe also there is some even more optimal builtin function for doing copy of same-length ranges? Like l.copy_range(start0, n, start1). Same like in C++ exists std::copy(vec.begin() + start0, vec.begin() + start0 + n, vec.begin() + start1) same way Python may have built-in C-based method for such copying of references.

Also how about step? For doing copy of [start0 : start0 + n : step].

Next is also a way to copy range:

for i in range(n):
    l[sart1 + i] = l[sart0 + i]

but probably not as efficient as first method.

So what's the most efficient way to copy a range from one place in list to another place in same list?

PS. BTW, for my specific task I need not even to copy range but to move range from one place to another, i.e. old region can be None-filled. If built-in function for moving region existed in Python then for python it would be just enough to use C-function like memmove() and zeroize old region, no need to copy or increment/decrement ref-counts of entries.

I find it odd that the timings for `n = 10` increase from `1.026 µs` to `12.316 µs` and I wouldn't call that "not very different times". If I move `l = [0] * size` to before `for t in t1:`, then they stay about constant. — superb rain, Oct 08 '20 at 10:47
I think your n=11 timings are misleading, as in practice you probably wouldn't make it re-allocate *every single time*. — superb rain, Oct 08 '20 at 11:04
@superbrain I did n 10 9 11 specifically to show the case when just range entries are copied (10), when moving whole array is needed (9), and when moving whole array plus maybe re-allocation is needed (11). This measurements I did intentionally just to check that case 10 in fact doesn't recreate whole array, because in naive implementation Python could do this case like this - 1) create temporary array consisting of [0:start0] 2) create temporary array of [start0 + n0:] 3) create temporary of [start1:start1 + n1]. Then create new array of concatenation of 1)-3). This non-optim. impl can exist — Arty, Oct 08 '20 at 12:13
@superbrain Case of 10 I call `not-very-different` meaning that it is not that much different (like thousands times) as for 9 and 11 cases.It could differ for small arrays maybe just due to cache usage- look at 1M and 8M sizes - there cache is probably not used, and case-10 is almost of same timing, even 8M a bit faster.Also moving out creation of `l` is possible only for case-10,but I expected timing module `timerit` to skip this time because of `with` block, according to doc everything before with is not counted in time,maybe more than 10 entries needed to be copied for timerit to be precise — Arty, Oct 08 '20 at 12:18
@superbrain Also you can see that for all sizes <=2048 times for case-10 (and even for case-9 and case-11) time is almost same, and after these sizes it starts to grow. Definitely this happens due to Level-1 cache being hit for small sizes. — Arty, Oct 08 '20 at 12:34

Jérôme Richard · Answer 1 · 2020-10-08T09:43:40.367

TL;DR: The CPython interpreter creates a shallow copy of slice of l and then perform a shallow copy.

Lists are basically an array of object reference. A shallow copy of a list is a memory array containing only object reference (eg. pointers) to objects shared with other lists (eg. at least the copied one just after the slice has been created). A deep copy of a list is is a memory array containing object reference to object copies newly created from the source list. You can find more information about Python shallow/deep copies here, here or there.

This means that when you write l[0:c], a new array is created in memory referencing c object in the list l. You can see that by building huge lists and look at memory usage and execution time:

# CPython process memory: < 50 Mo
a = [0 for i in range(256*1024*1024)]  # time: 5.45 s
# CPython process memory: ~2 Go
# Note that all references refer to the same immutable object 0 and each reference (memory pointer) takes 8 byte on my 64-bit machine.
tmp = a[0:128*1024*1024]  # time: 372 ms
# CPython process memory: ~3 Go
# tmp takes 1 Go here because of the 128*1024*1024 8-byte references
del tmp  # time: 219 ms
# CPython process memory: ~2 Go
# The internal reference memory array of tmp has been freed
a[0:128*1024*1024] = a[128*1024*1024:256*1024*1024]  # time: 1.30 s
# CPython process memory: ~2 Go
# CPython allocate 1 Go of additional memory during a fraction of time and then free it after the shallow copy has been made.

Playing with many immutable objects is also interesting (note that the amount of memory used is not very precise here):

# CPython process memory: < 50 Mo
%time a = [i for i in range(64*1024*1024)]
# CPython process memory: ~2.6 Go
# Additionally to references, a lot of integer object have to be created (much heavier than reference)
%time tmp = a[0:32*1024*1024]
# CPython process memory: ~2.8 Go
# No integer objects are copied but reference are.
%time del tmp
# CPython process memory: ~2.6 Go
# References have been freed.
%time a[0:32*1024*1024] = a[32*1024*1024:64*1024*1024]
# CPython process memory: ~1.5 Go
# Only the object references are copied, not the integer objects.
# Half the integer objects have been freed because they are not referenced anymore. Each integer object is referenced twice in the array.

Thus, no specific optimizations are performed by CPython in your case: the interpreter creates a shallow copy of slice of the list and then perform a shallow copy (and resize the target list if the size of the two slices is different).

AFAIK there is no way to create a Python slice without a copy. One can use generators/loops to avoid a copy but this is slow. I am not aware of a faster way to perform a list copy. One solution is to move to Numpy as it has better memory views (more compact in memory):

a = np.ones(256*1024*1024, dtype=np.int32)  # time: 515 ms
# CPython process memory: ~1 Go
# No reference are created, just a raw memory array of 32-bit integers.
tmp = a[0:128*1024*1024]  # time: 24 ms
# CPython process memory: ~1 Go
# No shallow copy is created, just a (light) view object.
del tmp  # time: 21 ms
# CPython process memory: ~1 Go
a[0:128*1024*1024] = a[128*1024*1024:256*1024*1024]  # time: 94 ms
# CPython process memory: ~1 Go
# Integers are copied (as there is no references).

Note that Numpy arrays can also contain Python object although this not very efficient. It should be still faster than CPython list if slices are big enough. Additionally, note that you can perform a deep copy of a list or any Python object with copy.deepcopy(...).

But what about the original question. Regarding how to copy list's sub-range from one place in list to another place in same list? What is the fastest way to make both - shallow and deep copy of objects in this sub-range? — Arty, Oct 08 '20 at 08:30
BTW, just for my current task it is quite enough to have a shallow copy, just to copy references, but it would be also interesting to know how to make a deep copy too. I updated my question to reflect that I need shallow copy for my task, but deep-copy is also interesting to solve. — Arty, Oct 08 '20 at 08:38
Also I don't need to copy a slice reference - I need to copy whole range of pointers (references to elements) from one place of list to another place in same list. Same behavior is needed like in very last small piece of code in my question which does it through regular loop. — Arty, Oct 08 '20 at 08:54
I'm looking for fast built-in methods (if they exist) like `lst.shallow_copy_range(start0, n, start1)` or `lst.deep_copy_range(start0, n, start1)`. Or if my method described in my question is the fastest then my method is enough. — Arty, Oct 08 '20 at 08:58
Same like in `C++` there is `std::copy(vec.begin() + 100, vec.begin() + 200, vec.begin() + 1000)` to copy range. Same optimized copy can be present somewhere in Python theoretically to copy references. — Arty, Oct 08 '20 at 09:08
I edited the question to add more information regarding your comments. I do not think there is a faster method for copying object references from one Python list to another (AFAIK the temporary list creation due to list slicing cannot be avoided in plain Python). The analogous operation to what you want in C++ is what Numpy should already do with its views: the Numpy view creation time is cheap and the copy is similar to your C++ line (except compilers can replace it with a fast `memcpy`/`memmove` call while Numpy will probably not do that). — Jérôme Richard, Oct 08 '20 at 09:16
Thanks! Also in fact for my current specific task I need actually to move range of entries from one place in same list to another. So indeed not only that shallow copy is enough, also it is not even needed (internally) to copy references (with ref-count increment), it is enough for python to do internally `memmove(...)` and zeroize old memory region, if such built-in function existed. — Arty, Oct 08 '20 at 09:23
Actually, due to reference counting in CPython, Numpy and CPython cannot perform a simple `memmove(...)`: the counters of the copied references must be incremented and the ones of the overwritten references must be decremented. This operation is quite expensive. AFAIK, the only way to overcome this problem is to use another interpreter like *PyPy* which uses a *tracing GC* rather than automatic reference counting or *not to use Python object* at all (eg. using basic Numpy types). Note that you can mix Numpy with Numba to efficiently copy small ranges (Numpy call are slower than built-in). — Jérôme Richard, Oct 08 '20 at 09:39
By using `memmove` I actually meant that ref-counts of that range that is copied is not changed hence memmove can be used. But for that place where it is copied to of cause that range needs decrement of references. So half of work can be memmove-done. — Arty, Oct 08 '20 at 09:42
BTW, If I use numpy with plain types (ints) and do something like `a[10:100] = a[210:300]` is it achieved by `memmove`/`memcpy` or numpy does all only through iteration of Cython loops? — Arty, Oct 08 '20 at 09:43
I think Numpy nor CPython cannot know that some counters may not need to be updated without a rather expensive analysis. You could write your own C/C++ CPython module to do that more efficiently. Numpy Array assignment is performed by a generic native C loop. It is as fast as `memmove` on my machine (it may not be the case on all platforms). Copying Python objects in a Numpy array is about 4 time slower on my machine mainly due to [CPython reference counting function calls](https://docs.python.org/3/c-api/refcounting.html). — Jérôme Richard, Oct 08 '20 at 10:28

Python - efficient copying of list sub-range

1 Answers1