Remove a dictionary key if it is a substring in any other key

Question

I'm learning Python. I've got a performance issue. For a single dictionary, I want to remove keys if

The a key is a substring in another key

I don't want to remove keys if

The key substring is itself

My keys are unique strings mostly between 3-50 characters in length. The dictionary I'm working with has 100,000 or more items, making billions of comparisons. Since this is an O(n^2) problem, should I stop trying to optimize this code? Or is there room to make headway here?

A dictionary is preferable, but I'm open to other types.

For example: 'hello' contains 'he' and 'ell'. I want to remove the keys 'he' and 'ell' while keeping 'hello'. I'd like to remove prefixes, suffixes, and key-substrings in the middle of other keys.

The keys are generated one-by-one and added to a dictionary. Then reduce_dict(dictionary) runs. My assumption is: a test while they're added to a dictionary would be as slow as a function testing after, as in the code below.

def reduce_dict(dictionary):
    reduced = dictionary.copy()
    for key in dictionary:
        for key2 in dictionary:
            if key != key2:
                if key2 in key:
                    reduced.pop(key2, 0)
    return reduced

your dictionary has same key multiple times? your question seems unclear. — salmanwahed, Aug 21 '14 at 08:12
The keys are unique strings. Usually short between 3-50 characters. I'll update the question. — 12345678910111213, Aug 21 '14 at 08:13
It will be clearer if you use the term "substring", because it seems that's what you mean. — Daniel Hershcovich, Aug 21 '14 at 08:19
Ok now I understood better... I deleted my answer since it does not address the real problem. I stay around since I would like to find a good answer for this. — bgusach, Aug 21 '14 at 08:21
One easy gain would be to make the inner loop a dictionary comprehension that builds the reduced dict — John Greenall, Aug 21 '14 at 08:21
Do you want to check for substring or prefix? That is do you want to keep the key `ab` even if you have the key `cab`? Or only remove it if you have keys that start with `ab`? — Seb D., Aug 21 '14 at 08:53
I want a substring so that the key will be deleted even if it's in the middle of another. I'll update the question. — 12345678910111213, Aug 21 '14 at 08:56
How are these keys generated? Is it possible to (without extra computation) create a separate dictionary for keys which are substrings? — dshepherd, Aug 21 '14 at 09:16
The keys are generated one-by-one and added to a dictionary. Then my reduce_dict(dictionary) function runs. I have no way of knowing ahead of time if the key I'm adding is a substring. That's what this function tries to figure. — 12345678910111213, Aug 21 '14 at 09:19

score 2 · Answer 1 · answered Aug 21 '14 at 08:55

2

Given that your strings are somewhat small, you could store a hashset of all possible substrings for each key. That would enable you to, for a given substring, find all keys that have matching substrings in O(N) time, however the tradeoff is you'd be increasing thr time complexity of your insertions since you'd be constructing a set of substrings for each new key.

answered Aug 21 '14 at 08:55

Sean Azlin

886
7
21

I think this is the fastest of the answers here. Given that your strings are limited to ~50 characters they have less than [\sum_{n=1}^50 (51 - n) = 1275](http://math.stackexchange.com/questions/113057/how-many-substrings-of-length-m-can-be-formed-from-a-string-of-length-n) substrings each. So for n_string >> 1275 this method wins. – dshepherd Aug 21 '14 at 09:55

georg · Accepted Answer · 2014-08-22T08:23:14.703

2

I think you can create a list of "good" keys (=those that are not substrings of others) in a slightly optimized way:

# keys = yourDict.keys(), e.g.
keys = ['low', 'el', 'helloworld', 'something', 'ellow', 'thing', 'blah', 'thingy']

# flt is [[key, is_substring],...] sorted by key length reversed
flt = [[x, 0] for x in sorted(keys, key=len, reverse=True)]

for i in range(len(flt)):
    p = flt[i]
    if p[1]:  # already removed
        continue
    for j in range(i + 1, len(flt)): # iterate over shorter strings
        q = flt[j]
        if not q[1] and q[0] in p[0]: # if not already removed and is substring
            q[1] = 1  # remove

goodkeys = set(x[0] for x in flt if not x[1])
print goodkeys # e.g ['helloworld', 'something', 'thingy', 'blah']

Now the removal is trivial:

newdict = {k:olddict[k] for k in goodkeys}

edited Aug 22 '14 at 08:23

answered Aug 21 '14 at 09:21

georg

211,518
52
313
390

Just curious: is there an efficiency reason to use `for i in range(len(flt)):` instead of `for p, i in enumerate(flt):`? (and similarly for the second for loop with `for q, j in enumerate(flt[i+1:])`) – dshepherd Aug 21 '14 at 09:26
@dshepherd: `flt[i+1:]` will create a copy each time. – georg Aug 21 '14 at 09:26
I thought [slices didn't copy in python](http://stackoverflow.com/questions/5131538/slicing-a-list-in-python-without-generating-a-copy). But even if they do: same question for an iterator version. – dshepherd Aug 21 '14 at 09:34
1

@dshepherd: slices don't copy list _items_, but they do copy _lists_. e.g. given `x=[a,b,c]` where a,b,c are objects, `x[1:]` creates a new list `[b,c]`. No objects are copied, but the memory is still allocated for a new list structure - this is what I'm trying to avoid. – georg Aug 21 '14 at 09:38
1

This method was faster at the beginning of the list then, continued to accelerate towards the end to negligible time. – 12345678910111213 Aug 22 '14 at 01:57

score 1 · Answer 3 · edited May 23 '17 at 12:21

If instead of key2 in key (that is, "key2 is a substring of key") you change your requirement to "key2 is a prefix of key" (as your examples demonstrate), you can use a trie for efficient prefix checks. See this answer.

First define make_trie as in the answer above:

_end = '_end_'

def make_trie(*words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict = current_dict.setdefault(_end, _end)
    return root

Then define a function like in_trie from the answer above, but checking whether a key is a strict prefix of another key:

def is_strict_prefix_of_word_in_trie(trie, word):
   current_dict = trie
   for letter in word:
       if letter in current_dict:
           current_dict = current_dict[letter]
       else:
           return False
   else:
       if _end in current_dict:
           return False # it's actually in the trie
       else:
           return True # it's a strict prefix of a word in the trie

Finally, perform your deletions like so:

def reduce_dict(dictionary):
    trie = make_trie(dictionary.keys())
    reduced = dictionary.copy()
    for key in dictionary:
       if is_strict_prefix_of_word_in_trie(trie, key):
           reduced.pop(key, 0)
    return reduced

Or you can use dictionary comprehension:

def reduce_dict(dictionary):
    trie = make_trie(dictionary.keys())
    return {key: value for (key, value) in dictionary \
            if not is_strict_prefix_of_word_in_trie(trie, key)}

Unfortunately, the OP wants _substrings_, not _prefix_ so it doesn't work. — Seb D., Aug 21 '14 at 09:14

Serge Ballesta · Answer 4 · 2014-08-21T09:35:46.597

If the dictionnary is static, IMHO it is no use optimising the operation : it will run only once, and in less time than you need to carefully optimise and test the optimisations.

If the dictionnary is dynamic, you could try to put a timestamp in value if it make sense of keep a list of keys that have already be cleaned. So when you run again your cleaning process, you have 2 sets of keys : the one that was processed one (size n1), and new keys size (n2). You only compare :

a new key may be substring of old key
a old key may be substring of new key
a new key may be substring of new key

So you have n2 * (n2 + 2 * n1) comparisons. If n >> n2 it is O(n * n2 * 2).

Alternatively, if the addition of an element in dictionary is not in a time bounded operation (nor in a interactive one), you could test at each addition in O(2n) without nothing else to add (neither keeping keys, nor timestamp).

Indeed, if you cleanup your dictionnary once with a trivial O(n²) algorythm, and then control the keys whenever a new element is generated, you can safely assume that none of the existing keys can be a substring of another one. You just have to test :

is the new key a substring of an existing key - n operations in worst case (but probably the most common one)
is an existing key a substring of the new key - n operations in all cases.

The only requirement, is that you must never try to add a key, before the full operation of cleaning for the previous one is over. That may be evident if there only one thread in one process acceding to the dictionnary, if not you will need synchronization.

Could you clarify how testing for the addition of an element individually could be O(2n)? I do have the ability to perform a test as each element is generated. — 12345678910111213, Aug 21 '14 at 09:15
It's the assumption that `n >> n2` that helps, it won't be O(2n) if you do it for the entire set. — dshepherd, Aug 21 '14 at 09:29

score 1 · Answer 5 · answered Aug 21 '14 at 10:10

As the keys are string, you can use the find method to get the substring and delete them by keys.

If d is a dictionary,

d = {'hello': 1, 'he': 2, 'llo': 3, 'world': 4, 'wor': 5, 'ld': 6, 'python': 2.7}

for key in d.keys():
    for sub in d.keys():
        if key.find(sub) >= 0:
            if key == sub:
                continue
            else:
                del(d[sub])

d will be then,

{'python': 2.7, 'world': 4, 'hello': 1}

Remove a dictionary key if it is a substring in any other key

5 Answers5

Linked