I'm learning Python. I've got a performance issue. For a single dictionary, I want to remove keys if
- The a key is a substring in another key
I don't want to remove keys if
- The key substring is itself
My keys are unique strings mostly between 3-50 characters in length. The dictionary I'm working with has 100,000 or more items, making billions of comparisons. Since this is an O(n^2) problem, should I stop trying to optimize this code? Or is there room to make headway here?
A dictionary is preferable, but I'm open to other types.
For example: 'hello' contains 'he' and 'ell'. I want to remove the keys 'he' and 'ell' while keeping 'hello'. I'd like to remove prefixes, suffixes, and key-substrings in the middle of other keys.
The keys are generated one-by-one and added to a dictionary. Then reduce_dict(dictionary)
runs. My assumption is: a test while they're added to a dictionary would be as slow as a function testing after, as in the code below.
def reduce_dict(dictionary):
reduced = dictionary.copy()
for key in dictionary:
for key2 in dictionary:
if key != key2:
if key2 in key:
reduced.pop(key2, 0)
return reduced