1

I'm studying Python now, and I'm trying to understand how containers work in practice. There is an issue I can't explain. Suppose, I create very large list:

>>> l = [i for i in range(100000000)] # ~3 sec

It takes ~3 seconds to create it (I use ascending numbers instead of the same value to avoid possible optimization)

As we can read here, delete operation costs O(n). But when I remove an element from the middle of the list, it returns instantly (just as fast as any other simple command, like element access)

>>> del l[50000000] # instantly (< 0.1 sec)

and after this I can access elements l[25000000] and l[75000000] in less than 3 seconds after deletion, and it also executes instantly (so, I can't explain this by delayed or background removal).

Can someone, please, explain me, how is it done internally? Is the list in fact implemented as some kind of tree? It sounds weird, and also, it would violate requirements for constant time element access.

Is it a common optimization, like return value optimization in C++, or something rare, specific only to my platform/version?

I use Linux and Python 3.4.1 (Python 2.7.9 shows the same results).

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Vasily
  • 65
  • 6
  • 1) That's not what "time-complexity" means. 2) Show the complete piece of code you're benchmarking 3) Show how you do it. – BartoszKP Sep 03 '15 at 11:22
  • 3
    *"Is the list in fact implemented as some kind of tree?"* - no, and there's not some *"background removal"* going on either. Python is open source, you can see the implementation: https://hg.python.org/cpython/file/tip/Objects/listobject.c – jonrsharpe Sep 03 '15 at 11:24
  • I'm using python interpreter, not separate program. I've provided all the code I've typed. Under time-complexity I mean asymptotic complexity of operation, not the time required for they execution. – Vasily Sep 03 '15 at 11:25
  • "Go look at source code", are you serious? – Vasily Sep 03 '15 at 11:26
  • 5
    @Vasily yes, completely. You're *"trying to understand how containers work"*, and that's one way to do it. Also note that you need to include `@username` in your comments to inform other users that you're addressing them. – jonrsharpe Sep 03 '15 at 11:26
  • 2
    The listobject.c source is quite easy to read, assuming you have some familiarity with C syntax. – PM 2Ring Sep 03 '15 at 11:27
  • @jonrsharpe OK, we just see the purpose of this site differently. No offence or disrespect intended. – Vasily Sep 03 '15 at 11:39
  • The `delete item` operation in link, is `list_var.remove(x)` operation, not the `del list_var`. – Ashwani Agarwal Sep 03 '15 at 11:53
  • @Ashwani Agarwal And I'm not implying it is. I'm implying that if you delete single element from the middle of a very large array, and you want to keep constant time access time - it will costs you O(n). – Vasily Sep 03 '15 at 12:00
  • Instead of removing the 50000001st element, try removing the first. Try with `l = list(range(100000000)); for i in range(50): del l[0]` – Andrea Corbellini Sep 03 '15 at 12:06
  • @Andrea Corbellini OK, I see. Removing the first 50 elements costs a couple of seconds (removing one first element is almost instant too). But still - why is it so fast? Each of these 50 deletion should be O(n), right? So, why initializing an element with counter value can be 100 times higher than copying value from the next element? – Vasily Sep 03 '15 at 12:13
  • @Vasily Nothing happens "instantly" on a computer, and 0.1 seconds is certainly not instant (most people would consider it pretty damn slow for a single list operation, in fact). Also, the time complexity of an operation *doesn't* tell you'll have to wait. – Will Vousden Sep 03 '15 at 12:28
  • Firstly, it's not "fast": it may be fast for a human, but try putting that code into a webserver under high load :)These 50 deletions are O(n), that's right. But ask yourself: what does O(n) mean? Remember that n is O(n), as well as 100n, 10^5n + log(n), ... – Andrea Corbellini Sep 03 '15 at 12:37
  • @Andrea Corbellini Yes, but this difference is too big! It's 100 times. If single deletion shift all the elements - it has to cost approximately the same time as creating these elements. And actually, the most important part of your message is "(not all, see the source code)" - because it looks like an answer to my question (the question is answered if delete can avoid linear time, at least for some number of first deletion). Is Python list some kind of advanced structure, like deque in C++ with tricky deletion logic that can change size of single deque slice unless too many elements deleted? – Vasily Sep 03 '15 at 12:38
  • You must also consider that when you write `[i for i in range(100000000)]` you are running a for-loop (which takes time, a lot of time compared to `list(range(100000000))`) and you are building a lot of `int` objects. – Andrea Corbellini Sep 03 '15 at 12:43
  • About the "(not all, see the source code)" part: I was referring to right-sided deletions (e.g. `del l[-1]`), which is not your case, sorry! – Andrea Corbellini Sep 03 '15 at 12:44
  • *"If single deletion shift all the elements - it has to cost approximately the same time as creating these elements"* - that is simply not an accurate assumption. *"Is Python list some kind of advanced structure, like deque in C++"* - no, just an array of references to objects, see e.g. http://stackoverflow.com/q/3917574/3001761. – jonrsharpe Sep 03 '15 at 12:56
  • @ Andrea Corbellini, Yeah, I didn't know I can do this - I'm just started to learn Python. But I've tried to create it the second way, and it seems like it spends the same time for both operation. Probably the result of optimization. And about deletion - yes, I see. Than it's still an open question. – Vasily Sep 03 '15 at 12:58
  • @ jonrsharpe Thanks for this link. I can measure the time needed for execution of my code, and experiment with it! It looks like it costs 43ms to delete the first element of the list (and it linearly decreases to 0 as we try to delete next elements). Ok, I see that deletion logic is straightforward - just a shift. So, that's the answer? It's not the deletion that so quick, it's the creation that so slow? According to my measurements, creation of a list needs 71 time more time per element than shift (deletion of the 1st element)... – Vasily Sep 03 '15 at 13:16
  • It's still not fully clear for me, anyway. How the creation can be 71 times slower? Based on my C++ experience initialization of array costs approximately the same time as shifting it. Work with the referenced objects has such a big overhead? Yes, it has to be, but not two orders of magnitude! – Vasily Sep 03 '15 at 13:19

2 Answers2

7

I decided to turn my set of comments into a proper answer.

First off, let's clarify what's happening when you do:

>>> l = [i for i in range(100000000)]

Here three things are happening:

  1. 100000000 int objects are being created. Creating an object in CPython requires allocating memory and placing content into that memory, and this takes time.
  2. You are running a loop. This affects performances considerably: [i for i in range(...)] is much slower than list(range(...)).
  3. The large list is being created on the fly.

Reading your question, it seems that you are considering only the last point, ignoring the others. Therefore, your timings are inaccurate: creating a large list does not take 3 seconds, it takes a fraction of those 3 seconds.

How large is this fraction is an interesting question, which is difficult to answer using only Python code, but still we can try. Specifically, I'd try with the following statement:

>>> [None] * 100000000

Here CPython does not have to create a large amount of objects (there's only None), does not have to run loops and can allocate the memory for the list once (because it knows the size in advance).

Timings are self-explanatory:

$ python3 -m timeit "list(range(100000000))"
10 loops, best of 3: 2.26 sec per loop
$ python3 -m timeit "[None] * 100000000"
10 loops, best of 3: 375 msec per loop

Now, back to your question: how about item deletion?

$ python3 -m timeit --setup "l = [None] * 100000000" "del l[0]"
10 loops, best of 3: 89 msec per loop
$ python3 -m timeit --setup "l = [None] * 100000000" "del l[100000000 // 4]"
10 loops, best of 3: 66.5 msec per loop
$ python3 -m timeit --setup "l = [None] * 100000000" "del l[100000000 // 2]"
10 loops, best of 3: 45.3 msec per loop

These numbers tell us something important. Note that 2 × 45.3 ≈ 89. Also 66.5 × 4 / 3 ≈ 89.

These numbers are telling exactly what linear complexity is about. If a function has time complexity kn (which is O(n)), it means that if we double the input, we double time; if we increase the input size by 4/3, the time increases by 4/3.

And this is what's happening here. In CPython, our list of 100000000 items is a contiguous memory area containing pointers to Python objects:

l = |ptr0|ptr1|ptr2|...|ptr99999999|

When we run del l[0] we are moving ptr1 from right to left, overwriting ptr0. The same for other elements:

l = |ptr0|ptr1|ptr2|...|ptr99999999|
     ^^^^
         ` item to delete

l = |ptr1|ptr2|...|ptr99999999|

Therefore, when we run del l[0] we have to move 99999998 pointers to the left. This is different from del l[100000000 // 2], which requires moving only half the pointers (the pointers on the first half don't need to be moved). "Moving half the pointers" equals "performing half of the operations" which roughly mean "running in half the time" (this is not always true, but timings say that this is true in this case).

Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
  • So, allocation of objects if 6 time slower than allocation references, and allocation and initialization (with Nones) of references 8 times slowly than shifting them. Well, I think I can get on with it. Although it seems to be quite an overhead to me. – Vasily Sep 03 '15 at 13:38
  • In addition to that, reading the code, as suggested by @jonrsharpe reveals that the list is created [bit by bit](https://hg.python.org/cpython/file/tip/Objects/listobject.c#l47) whereas the [deletion](https://hg.python.org/cpython/file/tip/Objects/listobject.c#l572) mainly uses `memmove`. – 301_Moved_Permanently Sep 03 '15 at 13:38
3

I'm not sure why you think it should take 3 seconds to delete a single element.

Your initial time is for 100000000 individual append operations. Each of those takes a fraction of a second; your delete operation takes a similar amount of time.

In any case, as Bartosz points out, O(n) complexity doesn't mean that all operations take the same length of time, it means that the length of time is proportional to the length of the list.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • 2
    It's worth noting that the underlying object will be resized on the way up, but not on the way back down (until it gets to half full), which adds more overhead. – jonrsharpe Sep 03 '15 at 11:29
  • So, your answer is "there is not complexity issue, it's just the same complexity with different operation time"? Yes, I agree, my initial time is 10m\*time(append). But when I remove element from the middle of array - I have to shift 5m element or I lose possibility to access elements at constant time. So, the time should be 5m\*time(assign). Both operation are cache-friendly. I can't imagine why time for assigning element with counter value should be 50 times higher than assigning it with neighbour value. – Vasily Sep 03 '15 at 11:46
  • @Vasily but that's not all that's happening. The removal is simply shifting existing references in a contiguous chunk of memory, whereas the list creation will also involve creating the objects to be referenced by the list and resizing the list to fit them all in. – jonrsharpe Sep 03 '15 at 12:37
  • @jonrsharpe Yes, this is interesting thought. I thought that the time needed for creating referenced objects is negligible. But now I've tried to first initialize the whole list with zeroes, and than assign them with 1s. It's considerably faster(about 0.3 sec). Although, still noticeably slower than deletion. It looks like the answer can be more complex than I expected. Not just tricky deletion logic, but also some Python memory management. – Vasily Sep 03 '15 at 12:49