I am curious why removing a line in my code results in a significant increase in performance. The function itself takes a dictionary and removes all keys which are substrings of other keys.
The line which slows my code down is:
if sub in reduced_dict and sub2 in reduced_dict:
Here's my function:
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
The function checks if sub is in sub2 many times. I assumed that if I checked for this comparison having already been made, I would be saving myself time. This doesn't seem to be the case. Why is the constant time function for lookup in a dictionary slowing me down?
I am a beginner so, I'm interested in concepts.
When I tested if the line in question is ever returning False, it appears that it is. I've tested this with the following
def reduced(dictionary):
reduced_dict = dictionary.copy()
len_dict = defaultdict(list)
for key in dictionary:
len_dict[len(key)].append(key)
start_time = time.time()
for key, subs in len_dict.items():
for key2, subs2 in len_dict.items():
if key2 > key:
for sub in subs:
for sub2 in subs2:
if sub not in reduced_dict or sub2 not in reduced_dict:
print 'not present' # This line prints many thousands of times
if sub in sub2:
reduced_dict.pop(sub, 0)
print time.time() - start_time
return reduced_dict
For 14,805 keys in the function's input dictionary:
- 19.6360001564 sec. without the line
- 33.1449999809 sec. with the line
Here are 3 dictionary examples. Biggest sample dictionary with 14805 keys, medium sample dictionary and smaller sample dictionary
I have graphed time in seconds (Y) vs input size in # of keys (X) for the first 14,000 keys in the biggest example dictionary. It appears all these functions have exponential complexity.
- John Zwinck answer for this question
- Matt my algorithm for this question without the dictionary comparision
- Matt exponential is from my first attempt at this problem. This took 76s
- Matt compare is the algorithm in this question with the dict comparison line
- tdelaney solution for this question. Algorithm 1 & 2 in order
- georg solution from a related question I asked
The accepted answer executes in apparently linear time.
I'm surprised to find magic ratio exists for input size where run time for a dict look-up == a string search.