I'm having some difficulties understanding (and ultimately solving) why having a large dictionary in memory makes creation of other dictionaries longer.
Here's the test code that I'm using
import time
def create_dict():
# return {x:[x]*125 for x in xrange(0, 100000)}
return {x:(x)*125 for x in xrange(0, 100000)} # UPDATED: to use tuples instead of list of values
class Foo(object):
@staticmethod
def dict_init():
start = time.clock()
Foo.sample_dict = create_dict()
print "dict_init in Foo took {0} sec".format(time.clock() - start)
if __name__ == '__main__':
Foo.dict_init()
for x in xrange(0, 10):
start = time.clock()
create_dict()
print "Run {0} took {1} seconds".format(x, time.clock() - start)
If I run the code as is (first initializing sample_dict in Foo) and then creating the same dictionary 10 more times in the loop I get the following results:
dict_init in Foo took 0.385263764287 sec
Run 0 took 0.548807949139 seconds
Run 1 took 0.533209452471 seconds
Run 2 took 0.51916067636 seconds
Run 3 took 0.513130722575 seconds
Run 4 took 0.508272050029 seconds
Run 5 took 0.502263872177 seconds
Run 6 took 0.48867601998 seconds
Run 7 took 0.483109299676 seconds
Run 8 took 0.479019713488 seconds
Run 9 took 0.473174195256 seconds
[Finished in 5.6s]
If, however, I do NOT initialize sample_dict in Foo (commenting out Foo.dict_init()) I'm getting almost 20% faster dictionary creation in the loop
Run 0 took 0.431378921359 seconds
Run 1 took 0.423696636179 seconds
Run 2 took 0.419630475616 seconds
Run 3 took 0.405130343806 seconds
Run 4 took 0.398099686921 seconds
Run 5 took 0.392837169802 seconds
Run 6 took 0.38799598399 seconds
Run 7 took 0.375133006408 seconds
Run 8 took 0.368755297573 seconds
Run 9 took 0.363273701371 seconds
[Finished in 4.0s]
I've noticed that if I turn OFF Python's garbage collector by calling gc.disable() performance not only improves ~5x but storing large dictionary in Foo doesn't make a difference. Here are results with garbage collection disabled:
dict_init in Foo took 0.0696136982496 sec
Run 0 took 0.113533445358 seconds
Run 1 took 0.111091241489 seconds
Run 2 took 0.111151620212 seconds
Run 3 took 0.110655722831 seconds
Run 4 took 0.111807537706 seconds
Run 5 took 0.11097510318 seconds
Run 6 took 0.110936170451 seconds
Run 7 took 0.111074414632 seconds
Run 8 took 0.110678488579 seconds
Run 9 took 0.111011066463 seconds
So I have 2 questions:
- Why does disabling garbage collection speeds up dictionary creation
- How to achieve even performance (with Foo init and w/o) without disable garbage collection
I would appreciate any insight on this.
Thank you!
UPDATED: After Tim Peters mentioned that I'm creating mutable objects, I've decided to try to create immutable objects (tuples in my case) and voila - much, much faster results (same with using gc and without)
dict_init in Foo took 0.017769 sec
Run 0 took 0.017547 seconds
Run 1 took 0.013234 seconds
Run 2 took 0.012791 seconds
Run 3 took 0.013371 seconds
Run 4 took 0.013288 seconds
Run 5 took 0.013692 seconds
Run 6 took 0.013059 seconds
Run 7 took 0.013311 seconds
Run 8 took 0.013343 seconds
Run 9 took 0.013675 seconds
I understand that tuple creation is much faster than list creation but why does having a dictionary of immutable objects doesn't affect time spent by garbage collection? Are immutable objects not involved in reference cycle?
Thank you.
P.S. As it happens, in my real-world scenario, converting list to tuples solved the problem (there was never a need to have lists, just didn't think of using tuples) but I'm still curious as to why it's faster.