TL;DR
What is the most efficient way to implement a filter function for a dictionary with keys of variable dimensions? The filter should take a tuple of the same dimensions as the dictionary's keys and output all keys in the dictionary which match the filter such that filter[i] is None or filter[i] == key[i]
for all dimensions i
.
In my current project, I need to handle dictionaries with a lot of data. The general structure of the dictionary is such that it contains tuples with 2 to 4 integers as keys and integers as values. All keys in a dictionary have the same dimensions. To illustrate, the following are examples of dictionaries I need to handle:
{(1, 2): 1, (1, 5): 2}
{(1, 5, 3): 2}
{(5, 2, 5, 2): 8}
These dictionaries contain a lot of entries, with the largest ones at about 20 000 entries. I frequently need to filter these entries, but often only looking at certain indices of the key tuples. Ideally, I want to have a function to which I can supply a filter tuple. The function should then return all keys which match the filter tuple. If the filter tuple contains a None
entry, then this will match any value in the dictionary's key tuple at this index.
Example of what the function should do for a dictionary with 2-dimensional keys:
>>> dict = {(1, 2): 1, (1, 5): 2, (2, 5): 1, (3, 9): 5}
>>> my_filter_fn((1, None))
{(1, 2), (1, 5)}
>>> my_filter_fn((None, 5))
{(1, 5), (2, 5)}
>>> my_filter_fn((2, 4))
set()
>>> my_filter_fn((None, None))
{(1, 2), (1, 5), (2, 5), (3, 9)}
As my dictionaries have different dimensions of their tuples, I have tried solving this problem by writing a generator expression which takes the dimensions of the tuple into account:
def my_filter_fn(entries: dict, match: tuple):
return (x for x in entries.keys() if all(match[i] is None or match[i] == x[i]
for i in range(len(key))))
Unfortunately, this is quite slow compared to writing out condition completely by hand ((match[0] is None or match[0] === x[0]) and (match[1] is None or match[1] == x[1]
); for 4 dimensions this is about 10 times slower. This is a problem for me as I need to do this filtering quite often.
Following code demonstrates the performance issue. Code is just supplied to illustrate the problem and enable reproduction of the tests. You can skip the code part, results are below.
import random
import timeit
def access_variable_length():
for key in entry_keys:
for k in (x for x in all_entries.keys() if all(key[i] is None or key[i] == x[i]
for i in range(len(key)))):
pass
def access_static_length():
for key in entry_keys:
for k in (x for x in all_entries.keys() if
(key[0] is None or x[0] == key[0])
and (key[1] is None or x[1] == key[1])
and (key[2] is None or x[2] == key[2])
and (key[3] is None or x[3] == key[3])):
pass
def get_rand_or_none(start, stop):
number = random.randint(start-1, stop)
if number == start-1:
number = None
return number
entry_keys = set()
for h in range(100):
entry_keys.add((get_rand_or_none(1, 200), get_rand_or_none(1, 10), get_rand_or_none(1, 4), get_rand_or_none(1, 7)))
all_entries = dict()
for l in range(13000):
all_entries[(random.randint(1, 200), random.randint(1, 10), random.randint(1, 4), random.randint(1, 7))] = 1
variable_time = timeit.timeit("access_variable_length()", "from __main__ import access_variable_length", number=10)
static_time = timeit.timeit("access_static_length()", "from __main__ import access_static_length", number=10)
print("variable length time: {}".format(variable_time))
print("static length time: {}".format(static_time))
Results:
variable length time: 9.625867042849316 static length time: 1.043319165662158
I would like to avoid having to create three different functions my_filter_fn2
, my_filter_fn3
, and my_filter_fn4
to cover all possible dimensions of my dictionaries and then use static dimensions filtering. I am aware that filtering for variable dimensions will always be slower than filtering for fixed dimensions, but was hoping that it would not be almost 10 times slower. As I am not a Python expert, I was hoping that there is a clever way in which my variable dimensions generator expression could be reformulated to give me better performance.
What is the most efficient way to filter a huge dictionary in the way I described?