0

I am working with quite big number of values in Python (memory footprint is 5GB).

Sometimes, I need to access values by key, sometimes I need to loop values. For performance reasons, I am converting the Dict to a List at startup, so I can:

  • use the Dict in cases where I want to access values by key
  • use the List in cases where I want to loop values
my_big_dict_of_values
my_big_values_list = list(my_big_dict_of_values.values())

Here's a performance comparison, just for clarity:

>some_dict = dict(zip(range(1000000), reversed(range(1000000))))
>some_list = list(some_dict.values())
>%timeit for t in some_dict.values(): t 
21.1 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>%timeit for t in some_list: t 
16.1 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

My problem arises when I need to delete keys from the dict based on user input. First, I delete entries from the dict with:

for key in keys_to_remove:
    del(my_big_dict_of_values[key])

After this operation, I also want to update my_big_values_list. I can do this with:

Solution A (Slow)

indexes_to_remove = list()
for idx, value in enumerate(my_big_values_list):
    if value.key in keys_to_remove:
        indexes_to_remove.append(idx)
for index in sorted(indexes_to_remove, reverse=True):
    del my_big_values_list[index]

However, this is really slow and cumbersome.

Ideally, I would like to just create the list from the dict again with:

Solution B (Fast with reference issue)

my_big_values_list = list(my_big_dict_of_values.values())

This is fast, but it seems to create a new reference. I would need to replace all references of my_big_values_list passed to other classes/functions, which appears strange, e.g. to illustrate.

my_big_dict_of_values
my_big_values_list = list(
    my_big_dict_of_values.values())

handle_process = handle_process_class(
    my_big_dict_of_values, my_big_values_list)

userinput = userinput(handle_process)

handle_process.calculate()

def userinput_class():
    def __init__(handle_process):
        self.handle_process = handle_process
    def user_del_key(key):
        del(self.handle_process.my_big_dict_of_values[key])
        # Update list here too:
        # Solution A works
        # Solution B throws error in
        # handle_process.calculate() because
        # handle_process still has old list

def handle_process_class():
    def __init__(my_big_dict_of_values, my_big_values_list):
        self.my_big_dict_of_values = my_big_dict_of_values
        self.my_big_values_list = my_big_values_list
    def calculate(self):
        return len(self.my_big_values_list)

Is there a way to modify my_big_values_list in place, but simply replace with a new list (e.g. list(my_big_dict_of_values.values())).

I have read how Python passes references to values, and I think I understand most of it. That's why I came up with solution A, but I don't know how to use Solution B to modify the referenced list. Perhaps someone can explain what is going on here?

Alex
  • 2,784
  • 2
  • 32
  • 46
  • 1
    Are you __sure__ you need this list at all ??? Looping over values is as simple as `for val in yourdict.values():` or - if you're using Python 2.7 and want to save RAM - `for val in yourdict.iter_values():` – bruno desthuilliers Jan 22 '19 at 10:53
  • I would agree in most cases, but iterating over lists [is faster](https://stackoverflow.com/questions/12543837/python-iterating-over-list-vs-over-dict-items-efficiency) than iterating over dict.values(), and this is noticeable at the volume of values in my dict() – Alex Jan 22 '19 at 11:14
  • Note: I edited my original answer and added a performance comparison – Alex Jan 22 '19 at 11:47
  • It might be faster in itself, but the overhead of maintaining the list adds some overhead too so depending on your concrete use case the net gain might _or not_ be that important (can't tell, really). Also, this might not be the main bottlneck in your code either so if you want to properly optimize your code, you really want to profile it first. (nb: not to say this current optimisation is useless or doesn't make sense - just that we human are notably bad at guessing where the true bottlenecks are in any non-trivial code). – bruno desthuilliers Jan 22 '19 at 11:50
  • Totally agree: I might need to properly performance test my code. A bit more background: the my_big_dict_of_values is the core of my program, and I am literally running millions of iterations in all parts of my code on it. – Alex Jan 22 '19 at 11:52

1 Answers1

4

To modify the list in-place, assign to its slice:

my_big_values_list[:] = list(my_big_dict_of_values.values())

Example:

>>> my_big_dict_of_values = {"a": 1, "b": 2, "c": 3}
>>> my_big_values_list = list(my_big_dict_of_values.values())
>>> another_list_reference = my_big_values_list

>>> print(my_big_values_list, another_list_reference)
[1, 2, 3] [1, 2, 3]

>>> del(my_big_dict_of_values["b"])
>>> my_big_values_list[:] = list(my_big_dict_of_values.values())

>>> print(my_big_values_list, another_list_reference)
[1, 3] [1, 3]

However in terms of performance and memory usage, you should consider if a separate huge list is really necessary, as you can loop directly over dictionary.values().

mportes
  • 1,589
  • 5
  • 13
  • Wow, so simple and well explained! Many thanks, this definitely helps my whole Python knowledge. – Alex Jan 22 '19 at 11:23