3

I am curious why removing a line in my code results in a significant increase in performance. The function itself takes a dictionary and removes all keys which are substrings of other keys.

The line which slows my code down is:

if sub in reduced_dict and sub2 in reduced_dict:

Here's my function:

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub in reduced_dict and sub2 in reduced_dict: # Removing this line gives a significant performance boost
                            if sub in sub2:
                                reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

The function checks if sub is in sub2 many times. I assumed that if I checked for this comparison having already been made, I would be saving myself time. This doesn't seem to be the case. Why is the constant time function for lookup in a dictionary slowing me down?

I am a beginner so, I'm interested in concepts.

When I tested if the line in question is ever returning False, it appears that it is. I've tested this with the following

def reduced(dictionary):
    reduced_dict = dictionary.copy()
    len_dict = defaultdict(list)
    for key in dictionary:
        len_dict[len(key)].append(key)
    start_time = time.time()
    for key, subs in len_dict.items():
        for key2, subs2 in len_dict.items():
            if key2 > key:
                for sub in subs:
                    for sub2 in subs2:
                        if sub not in reduced_dict or sub2 not in reduced_dict:
                            print 'not present' # This line prints many thousands of times
                        if sub in sub2:
                            reduced_dict.pop(sub, 0)
    print time.time() - start_time
    return reduced_dict

For 14,805 keys in the function's input dictionary:

  • 19.6360001564 sec. without the line
  • 33.1449999809 sec. with the line

Here are 3 dictionary examples. Biggest sample dictionary with 14805 keys, medium sample dictionary and smaller sample dictionary

I have graphed time in seconds (Y) vs input size in # of keys (X) for the first 14,000 keys in the biggest example dictionary. It appears all these functions have exponential complexity.

  • John Zwinck answer for this question
  • Matt my algorithm for this question without the dictionary comparision
  • Matt exponential is from my first attempt at this problem. This took 76s
  • Matt compare is the algorithm in this question with the dict comparison line
  • tdelaney solution for this question. Algorithm 1 & 2 in order
  • georg solution from a related question I asked

enter image description here

The accepted answer executes in apparently linear time.

Marcelo Cantos

I'm surprised to find magic ratio exists for input size where run time for a dict look-up == a string search.

Community
  • 1
  • 1
  • 1
    Can you include the sample dicts you're using to test this? Also, using `time.time()` to measure this isn't usually sufficiently accurate. You should use the `timeit` module instead. – dano Sep 13 '14 at 02:49
  • Here are 2 dictionary examples. The first is longer http://justpaste.it/festival_dict and the second is shorter http://justpaste.it/ATGC – 12345678910111213 Sep 13 '14 at 03:01
  • Your code is probably slow more due to the quadruple-nested "for" loops than anything else. :) – John Zwinck Sep 13 '14 at 03:07
  • @John Zwinck The above code runs much faster than a solution in a previous question of mine while trying to solve the same issue. http://stackoverflow.com/a/25422647/3761932 I will try to tackle the nesting but that's another question. – 12345678910111213 Sep 13 '14 at 03:13
  • You describe an algorithm that operates on the keys, but your code operates on the values. Is that just a typo? – Marcelo Cantos Sep 13 '14 at 03:17
  • @Marcelo Cantos The keys which are being examined as substrings are from the input dictionary called 'dictionary'. These keys are then inserted as values into another dictionary called len_dict. The purpose of len_dict is to put all keys together which have the same length. This way I will only search for keys which are substrings in other keys with a greater length. – 12345678910111213 Sep 13 '14 at 03:20
  • @mattkaeo: Indeed. I should read more carefully. – Marcelo Cantos Sep 13 '14 at 03:46

3 Answers3

2

For the sample corpus, or any corpus in which most keys are small, it's much faster to test all possible subkeys:

def reduced(dictionary):
    keys = set(dictionary.iterkeys())
    subkeys = set()
    for key in keys:
        for n in range(1, len(key)):
            for i in range(len(key) + 1 - n):
               subkey = key[i:i+n]
               if subkey in keys:
                   subkeys.add(subkey)

    return {k: v
            for (k, v) in dictionary.iteritems()
            if k not in subkeys}

This takes about 0.2s on my system (i7-3720QM 2.6GHz).

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • This is the fastest algorithm. I'm graphing it over 103,000 keys. I'll add this graph to the question. There appears to be a magic ratio for input size where a dict look-up == a string search. – 12345678910111213 Sep 14 '14 at 05:14
1

You create len_dict, but even though it groups keys of equal size, you still have to traverse everything multiple times to compare. Your basic plan is right - sort by size and only compare what's the same size or bigger, but there are other ways to do that. Below, I just created a regular list sorted by key size and then iterated backwards so that I could trim the dict as I went. I'm curious how its execution time compares to yours. It did your little dict example in .049 seconds.

(I hope it actually worked!)

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for k_fwd,v_fwd in items[i+1:]:
            if k in k_fwd:
                del items[i]
                break
    return dict(items)

EDIT

A significant speed increase by not unpacking k_fwd,v_fwd (after running both a few times, this wasn't really a speed-up. something else must have been eating time on my PC for awhile).

def myfilter(d):
    items = d.items()
    items.sort(key=lambda x: len(x[0]))
    for i in range(len(items)-2,-1,-1):
        k = items[i][0]
        for kv_fwd in items[i+1:]:
            if k in kv_fwd[0]:
                del items[i]
                break
    return dict(items)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Your function is faster. For 14,805 keys, my function without the problematic line ran in 17.75 sec. Yours in 12.3139998913 sec. Your function returned 7087 keys & mine 7086, which is interesting. I like your approach. Do you have any insight why a dictionary lookup extended my function's runtime? – 12345678910111213 Sep 13 '14 at 03:57
  • 1
    @mattkaeo dictionary lookups aren't *free*, they're just (pseudo) constant time. You're doing millions of lookups due to how your code is structured. – roippi Sep 13 '14 at 04:04
  • 1. Your 2 for loops for len_dict.items() means that you rebuild items in that second loop len(len_dict) times, and throw most of them away in the if right after. the remaining dict lookups are fairly expensive compared to an indexed lookup on a list. In truth, I thought I'd beat you by a lot more than that. Now I'm puzzled... – tdelaney Sep 13 '14 at 04:05
  • @tdelaney That's interesting. I didn't know dictionary lookups could be more expensive than index lookups from a list. Here's the much larger 14805 key length dictionary I've been using for experimentation. http://justpaste.it/14805keys – 12345678910111213 Sep 13 '14 at 04:13
  • @mattkaeo - that took 12.029 seconds on my machine. i'm doing a bit of tweaking... – tdelaney Sep 13 '14 at 04:19
  • 1
    oops, i read the results from a dict i generated w/ 22056 items. your 14805 set was 4.00 seconds. Interesting the increased time with the larger set - ain't exponential growth grand! – tdelaney Sep 13 '14 at 04:26
  • I had a bug in range() - it stopped at 1 not 0 so I returned one too many items. Also, I stopped unpacking k_fwd,v_fwd and had a big improvement to 4.34 seconds. editting the answer now. – tdelaney Sep 13 '14 at 04:34
  • @tdelaney Only 4 seconds? I'm curious why I'm not seeing a this same result. It appears your first algorithm ran faster for me. I wanted to see how each of these algorithms behaved as the input size increased. I've updated the question with a graph for my results. – 12345678910111213 Sep 13 '14 at 07:13
1

I would do it a bit differently. Here's a generator function which gives you the "good" keys only. This avoids creating a dict which may be largely destroyed key-by-key. I also have just two levels of "for" loops and some simple optimizations to try to find matches more quickly and avoid searching for impossible matches.

def reduced_keys(dictionary):
    keys = dictionary.keys()
    keys.sort(key=len, reverse=True) # longest first for max hit chance                                                                                                     
    for key1 in keys:
        found_in_key2 = False
        for key2 in keys:
            if len(key2) <= len(key1): # no more keys are long enough to match                                                                                              
                break
            if key1 in key2:
                found_in_key2 = True
                break
        if not found_in_key2:
            yield key1

If you want to make an actual dict using this, you can:

{ key: d[key] for key in reduced_keys(d) }
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • For my 14,805 key dictionary justpaste.it/14805keys yours = 27.7s with the same 7086 keys as my function. Mine = 17.7s. I'll try to time this in a different way. What is the most expensive part of your generator? Is it the length comparison or is it populating the dictionary? – 12345678910111213 Sep 13 '14 at 04:56
  • Interesting. We could cache the length of key1 outside the inner loop. And this algorithm should be easy to port to C if you want much higher speed (yet still usable from Python). I'd be interested to know how fast it is if you sort in non-reverse order and take out the check for key lengths completely. – John Zwinck Sep 13 '14 at 10:54