Background
I've been looking into a few ways to do essentially the same thing and would like some input on time complexity and whether it is true to the python wiki or not -- and why the general differences in speed exist. I'm using Python 3.7 for my tests (cpython).
More information on time complexity here and here.
Testable Code
import timeit
def test1():
new_d = {}
for i in range(0, 50000):
new_d[i] = None
def test2():
new_s = set()
for i in range(0, 50000):
new_s.add(i)
def test3():
new_l = list(range(0, 50000))
new_s = set(new_l)
def test3b():
new_s = set(range(0, 50000))
def test4():
new_l = list(range(0, 50000))
new_d = dict.fromkeys(new_l)
def test4b():
new_d = dict.fromkeys(range(0, 50000))
def test5():
new_l = list(range(0, 50000))
new_d = {key: None for key in new_l}
def test5b():
new_d = {key: None for key in range(0, 50000)}
tests = [test1, test2, test3, test3b, test4, test4b, test5, test5b]
for test in tests:
print(timeit.timeit(test, number=10000))
Reasoning
So above are 5 methods to do what seem to be very similar operations. The idea behind all of them is to take a range and add it to a hashed type object (set or dict) in order to do O(1) lookups.
test1:
Create a dict -> iterate range -> add to dict
test2:Create a set -> iterate range -> add to set
test3:List function on range -> [automatic] -> set function on list
test3b:. [automatic] -> [automatic] -> set function on range
test4:List function on range -> [automatic] -> dict from keys on list
test4b:. [automatic] -> [automatic] -> dict from keys on range
test5:List function on range -> [automatic] -> dict comprehension on list
test5b:. [automatic] -> [automatic] -> dict comprehension on range
Now to me (not an expert), the general logic behind all 5 of these is very similar, which is iterate the range and add to a hash table. If anything, I would guess that test1
and test2
would complete before others because they don't create a list and then iterate that list (On*2) and should complete in a single iteration.
I would also guess that each b function would be faster than its counterpart by an O(n) amount.
And then last I would guess that test1
would be nearly the same as test4b
(and test4
, test5
, and test5b
), test2
would be nearly the same as test3b
(and test3
)
Results
But alas, I'm here, so my assumptions are incorrect.
Methods: I tested on ranges 1000
(average of 3), 2000
(average of 3), 3000
(average of 3), 50000
(average of 2), 100000
(single run) and all with timeit.timeit(number=10000)
.
TOTAL TIME (truncated)
test1 test2 test3 test3b test4 test4b test5 test5b
1000 0.5094 0.7135 0.2758 0.2377 0.3735 0.3424 0.5173 0.4591
2000 1.0769 1.4383 0.6109 0.5161 0.7880 0.7226 1.0211 0.9565
3000 1.7337 2.1061 0.8848 0.7470 1.3066 1.1763 1.6431 1.6390
50000 34.6455 37.0113 16.0184 19.6853 35.4731 32.4986 40.2696 41.2130
100000 84.9230 89.7245 55.3949 47.8195 82.1506 75.7063 94.4856 84.2978
ADJUSTED TIME PER 1000 (rounded)
test1 test2 test3 test3b test4 test4b test5 test5b
1000 0.5094 0.7135 0.2758 0.2378 0.3736 0.3430 0.5174 0.4591
2000 0.5385 0.7192 0.3055 0.2581 0.3940 0.3613 0.5106 0.4783
3000 0.5779 0.7021 0.2950 0.249 0.4355 0.3921 0.5477 0.5464
50000 0.6929 0.7402 0.3204 0.3937 0.7095 0.6500 0.8054 0.8243
100000 0.8492 0.8972 0.5539 0.4782 0.8215 0.7571 0.9449 0.8430
Observations and Questions
Fastest: test3b
is nearly always fastest followed shortly by test3
(except at 50,000)
Slowest: test2
is slowest at small amounts, then test5b
at 50,000, and finally test5
at 100,000
The speed of the test does not linearly grow with change in n
, it is growing higher than the difference in n
. Normalized to per 1000
I expected to see much closer numbers. I also expect it, if anything, to reach a maximum per 1000 and then level off. So, as I have already admitted, my assumptions and expectations are wrong.
Can you help me out with understanding this performance? It's clear that test3b
is the fastest method by far, and that test2
performs poorly in comparison, even though they do similar things. I imagine it has to do with memory management and allocation of space for the hash table? My main question is... what's happening here and why?