Your complexity analysis is correct: n
operations to compute the sum plus n
operations to compute the list in both cases makes O(n)
.
But before we talk about speed, you surely have noticed that almost_all_b
has a side-effect whereas almost_all_a
is side-effect free. Worse, almost_all_b
is not idempotent. If you call repeatedly almost_all_b
, the argument numbers
will be modified each time. Unless you have a very good reason, you should prefer almost_all_a
to almost_all_b
since is easier to understand and less error prone.
Benchmark 1
I will try to confirm your assertion (almost_all_a
[is] more efficient than almost_all_b
) with timeit
:
>>> from timeit import timeit
>>> ns=list(range(100))
>>> timeit(lambda: almost_all_a(ns), number=10000)
0.06381335399782984
>>> timeit(lambda: almost_all_b(ns), number=10000)
2.3228586789991823
Wow! almost_all_a
is roughly 35 times faster than almost_all_b
!!! No. That was a joke. You can see what happened: almost_all_b
was applied 10000 times to [1,...,90]
with the side-effect, hence the numbers grew insanely:
>>> len(str(ns[0])) # number of digits of the first element!
19959
Ok, that was just to convince you to avoid functions with side effects.
Benchmark 2
Now, the real test:
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_a})
5.720672591000039
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_b})
5.937547881
Note that the benchmark might give different results with another list or on another platform. (Think of what will happen if the list takes 90 % of the available RAM.) But let's assume that we can generalize.
Python bytecode
Let's look at the bytecode:
>>> import dis
>>> dis.dis(almost_all_a)
2 0 LOAD_GLOBAL 0 (sum)
2 LOAD_DEREF 0 (numbers)
4 CALL_FUNCTION 1
6 STORE_DEREF 1 (sumall)
3 8 LOAD_GLOBAL 1 (range)
10 LOAD_GLOBAL 2 (len)
12 LOAD_DEREF 0 (numbers)
14 CALL_FUNCTION 1
16 CALL_FUNCTION 1
18 STORE_FAST 1 (rangelen)
4 20 LOAD_CLOSURE 0 (numbers)
22 LOAD_CLOSURE 1 (sumall)
24 BUILD_TUPLE 2
26 LOAD_CONST 1 (<code object <listcomp> at 0x7fdc551dee40, file "<stdin>", line 4>)
28 LOAD_CONST 2 ('almost_all_a.<locals>.<listcomp>')
30 MAKE_FUNCTION 8
32 LOAD_FAST 1 (rangelen)
34 GET_ITER
36 CALL_FUNCTION 1
38 RETURN_VALUE
And:
>>> dis.dis(almost_all_b)
2 0 LOAD_GLOBAL 0 (sum)
2 LOAD_FAST 0 (numbers)
4 CALL_FUNCTION 1
6 STORE_FAST 1 (sumall)
3 8 SETUP_LOOP 36 (to 46)
10 LOAD_GLOBAL 1 (range)
12 LOAD_GLOBAL 2 (len)
14 LOAD_FAST 0 (numbers)
16 CALL_FUNCTION 1
18 CALL_FUNCTION 1
20 GET_ITER
>> 22 FOR_ITER 20 (to 44)
24 STORE_FAST 2 (i)
4 26 LOAD_FAST 1 (sumall)
28 LOAD_FAST 0 (numbers)
30 LOAD_FAST 2 (i)
32 BINARY_SUBSCR
34 BINARY_SUBTRACT
36 LOAD_FAST 0 (numbers)
38 LOAD_FAST 2 (i)
40 STORE_SUBSCR
42 JUMP_ABSOLUTE 22
>> 44 POP_BLOCK
>> 46 LOAD_CONST 0 (None)
48 RETURN_VALUE
The beginning is almost the same. Then, you have a list comprehension that is like a black box. If we open the box, we see:
>>> dis.dis(almost_all_a.__code__.co_consts[1])
4 0 BUILD_LIST 0
2 LOAD_FAST 0 (.0)
>> 4 FOR_ITER 16 (to 22)
6 STORE_FAST 1 (i)
8 LOAD_DEREF 1 (sumall)
10 LOAD_DEREF 0 (numbers)
12 LOAD_FAST 1 (i)
14 BINARY_SUBSCR
16 BINARY_SUBTRACT
18 LIST_APPEND 2
20 JUMP_ABSOLUTE 4
>> 22 RETURN_VALUE
You have two differences:
- in the list comprehension,
sumall
and numbers
are loaded with LOAD_DEREF
and not LOAD_FAST
(that's normal for a closure) and should be a little slower;
- in the list comprehension,
LIST_APPEND
replaces the assignment to numbers[i]
(LOAD_FAST(numbers)/LOAD_FAST(i)/STORE_SUBSCR
at lines 36-40).
My guess is that the overhead comes from that assignement.
Another benchmark
You can rewrite almost_all_a
to be even neater, because you don't need the index:
def almost_all_c(numbers):
sumall = sum(numbers)
return [sumall - n for n in numbers]
This version is faster (on my example + platform) than almost_all_a
:
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_a})
5.755438814000172
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_b})
5.93645353099987
>>> timeit('ns=list(range(100));almost_all(ns)', globals={'almost_all':almost_all_c})
4.571863283000084
(Note that, as often in Python, the neater version is the fastest.) The difference between almost_all_a
and almost_all_c
is the use of the access to the i
-th item of numbers
(you can decompile almost_all_c
of you want to check).
Conlusion
I seems that the bottleneck here is the access to the i
-th item of numbers
:
- one time in
almost_all_a
;
- two times in
almost_all_b
;
- never in
almost_all_c
.
That's why almost_all_a
is faster than almost_all_b
.