Performance of my function removing keys from a dictionary which are substrings of other keys

Question

I am curious why removing a line in my code results in a significant increase in performance. The function itself takes a dictionary and removes all keys which are substrings of other keys.

The line which slows my code down is:

if sub in reduced_dict and sub2 in reduced_dict:

Here's my function:

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
                            if sub in sub2:
                                reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

The function checks if sub is in sub2 many times. I assumed that if I checked for this comparison having already been made, I would be saving myself time. This doesn't seem to be the case. Why is the constant time function for lookup in a dictionary slowing me down?

I am a beginner so, I'm interested in concepts.

When I tested if the line in question is ever returning False, it appears that it is. I've tested this with the following

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub not in reduced_dict or sub2 not in reduced_dict:
                            print 'not present' # This line prints many thousands of times
                        if sub in sub2:
                            reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

For 14,805 keys in the function's input dictionary:

19.6360001564 sec. without the line
33.1449999809 sec. with the line

Here are 3 dictionary examples. Biggest sample dictionary with 14805 keys, medium sample dictionary and smaller sample dictionary

I have graphed time in seconds (Y) vs input size in # of keys (X) for the first 14,000 keys in the biggest example dictionary. It appears all these functions have exponential complexity.

John Zwinck answer for this question
Matt my algorithm for this question without the dictionary comparision
Matt exponential is from my first attempt at this problem. This took 76s
Matt compare is the algorithm in this question with the dict comparison line
tdelaney solution for this question. Algorithm 1 & 2 in order
georg solution from a related question I asked

enter image description here

The accepted answer executes in apparently linear time.

Marcelo Cantos

I'm surprised to find magic ratio exists for input size where run time for a dict look-up == a string search.

Can you include the sample dicts you're using to test this? Also, using `time.time()` to measure this isn't usually sufficiently accurate. You should use the `timeit` module instead. — dano, Sep 13 '14 at 02:49
Here are 2 dictionary examples. The first is longer http://justpaste.it/festival_dict and the second is shorter http://justpaste.it/ATGC — 12345678910111213, Sep 13 '14 at 03:01
Your code is probably slow more due to the quadruple-nested "for" loops than anything else. :) — John Zwinck, Sep 13 '14 at 03:07
@John Zwinck The above code runs much faster than a solution in a previous question of mine while trying to solve the same issue. http://stackoverflow.com/a/25422647/3761932 I will try to tackle the nesting but that's another question. — 12345678910111213, Sep 13 '14 at 03:13
You describe an algorithm that operates on the keys, but your code operates on the values. Is that just a typo? — Marcelo Cantos, Sep 13 '14 at 03:17
@Marcelo Cantos The keys which are being examined as substrings are from the input dictionary called 'dictionary'. These keys are then inserted as values into another dictionary called len_dict. The purpose of len_dict is to put all keys together which have the same length. This way I will only search for keys which are substrings in other keys with a greater length. — 12345678910111213, Sep 13 '14 at 03:20

score 2 · Accepted Answer · answered Sep 14 '14 at 02:33

For the sample corpus, or any corpus in which most keys are small, it's much faster to test all possible subkeys:

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}

This takes about 0.2s on my system (i7-3720QM 2.6GHz).

This is the fastest algorithm. I'm graphing it over 103,000 keys. I'll add this graph to the question. There appears to be a magic ratio for input size where a dict look-up == a string search. — 12345678910111213, Sep 14 '14 at 05:14

tdelaney · Answer 2 · 2014-09-13T04:34:53.060

1

You create len_dict, but even though it groups keys of equal size, you still have to traverse everything multiple times to compare. Your basic plan is right - sort by size and only compare what's the same size or bigger, but there are other ways to do that. Below, I just created a regular list sorted by key size and then iterated backwards so that I could trim the dict as I went. I'm curious how its execution time compares to yours. It did your little dict example in .049 seconds.

(I hope it actually worked!)

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for k_fwd,v_fwd in items[i+1:]:
            if k in k_fwd:
                del items[i]
                break
    return dict(items)

EDIT

A significant speed increase by not unpacking k_fwd,v_fwd (after running both a few times, this wasn't really a speed-up. something else must have been eating time on my PC for awhile).

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for kv_fwd in items[i+1:]:
            if k in kv_fwd[0]:
                del items[i]
                break
    return dict(items)

edited Sep 13 '14 at 04:34

answered Sep 13 '14 at 03:42

tdelaney

73,364
6
83
116

Your function is faster. For 14,805 keys, my function without the problematic line ran in 17.75 sec. Yours in 12.3139998913 sec. Your function returned 7087 keys & mine 7086, which is interesting. I like your approach. Do you have any insight why a dictionary lookup extended my function's runtime? – 12345678910111213 Sep 13 '14 at 03:57
1

@mattkaeo dictionary lookups aren't *free*, they're just (pseudo) constant time. You're doing millions of lookups due to how your code is structured. – roippi Sep 13 '14 at 04:04
1. Your 2 for loops for len_dict.items() means that you rebuild items in that second loop len(len_dict) times, and throw most of them away in the if right after. the remaining dict lookups are fairly expensive compared to an indexed lookup on a list. In truth, I thought I'd beat you by a lot more than that. Now I'm puzzled... – tdelaney Sep 13 '14 at 04:05
@tdelaney That's interesting. I didn't know dictionary lookups could be more expensive than index lookups from a list. Here's the much larger 14805 key length dictionary I've been using for experimentation. http://justpaste.it/14805keys – 12345678910111213 Sep 13 '14 at 04:13
@mattkaeo - that took 12.029 seconds on my machine. i'm doing a bit of tweaking... – tdelaney Sep 13 '14 at 04:19
1

oops, i read the results from a dict i generated w/ 22056 items. your 14805 set was 4.00 seconds. Interesting the increased time with the larger set - ain't exponential growth grand! – tdelaney Sep 13 '14 at 04:26
I had a bug in range() - it stopped at 1 not 0 so I returned one too many items. Also, I stopped unpacking k_fwd,v_fwd and had a big improvement to 4.34 seconds. editting the answer now. – tdelaney Sep 13 '14 at 04:34
@tdelaney Only 4 seconds? I'm curious why I'm not seeing a this same result. It appears your first algorithm ran faster for me. I wanted to see how each of these algorithms behaved as the input size increased. I've updated the question with a graph for my results. – 12345678910111213 Sep 13 '14 at 07:13

score 1 · Answer 3 · answered Sep 13 '14 at 04:01

I would do it a bit differently. Here's a generator function which gives you the "good" keys only. This avoids creating a dict which may be largely destroyed key-by-key. I also have just two levels of "for" loops and some simple optimizations to try to find matches more quickly and avoid searching for impossible matches.

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

If you want to make an actual dict using this, you can:

{ key: d[key] for key in reduced_keys(d) }

For my 14,805 key dictionary justpaste.it/14805keys yours = 27.7s with the same 7086 keys as my function. Mine = 17.7s. I'll try to time this in a different way. What is the most expensive part of your generator? Is it the length comparison or is it populating the dictionary? — 12345678910111213, Sep 13 '14 at 04:56
Interesting. We could cache the length of key1 outside the inner loop. And this algorithm should be easy to port to C if you want much higher speed (yet still usable from Python). I'd be interested to know how fast it is if you sort in non-reverse order and take out the check for key lengths completely. — John Zwinck, Sep 13 '14 at 10:54

Performance of my function removing keys from a dictionary which are substrings of other keys

3 Answers3