1

I have a dict which has 20,000 keys and total size of dict is 150MB. I dump the dict via pickle to disk every hour and then load the pickle file on program startup. Here is the gist of the writing code

cache_copy = copy.deepcopy(self.cache) 
#self.cache is the dict
pickle.dump(cache_copy, cache_file, pickle.HIGHEST_PROTOCOL) 

Sometimes I get the following error

cache_copy = copy.deepcopy(self.cache)
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 256, in _deepcopy_dict
for key, value in x.iteritems():
RuntimeError: dictionary changed size during iteration

What is the best way to do this? I want to really avoid thread locking as it makes code complex. If it is really necessary, locking should be as minimal/simple as possible and allow for some concurrency. There are several constraints in my code that could help in this direction:

  • Multiples threads read and write to the dict(). However, in all writes only (key,value) pairs are added. (key, value) pairs are never deleted or modified
  • I am open to changing the data structure from dict() to something else. It should have functionality of fast memory lookup and writes
  • I don't mind stale writes. So if the dict() had some appends, and we write to a dict snapshot that is few seconds older that is ok.
dark knight
  • 304
  • 3
  • 11

1 Answers1

1

TL;DR edit

To avoid locking the dict during pickling, create a list that duplicates the dict:

# When you do this:
cache["new_key"] = "new_value"
# Also do this:
cache_list.append(("new_key", "new_value"))

And pickle the list instead.

Preferably, append to cache_file instead of overwriting it. This way, you can cache_list.clear() after each write to file, avoiding waste of memory and disk writes. But there can be a bug when some thread writes to the list right after the pickling, and then that value gets cleared. Maybe you're OK with losing some values if this is just a cache. If not, use some locks or just don't clear the list. I was incorrect about double memory usage because the list doesn't deepcopy the data, it only stores 20000 tuples, each with 2 references.

Original answer

You need to lock all writes if you want to iterate, copy or pickle your dict. However, if you really don't want to lock the threads that use the dict and you don't mind doubling your memory usage, I can suggest keeping a list of key-value pairs (which duplicates your dict), a lock for that list and a queue of writes to that list:

# When you do this:
cache["new_key"] = "new_value"

# Also do this:
if not list_is_locked:
    cache_list.append(("new_key", "new_value"))
else:
    # Write here immediately instead of waiting for list to unlock
    queue.append(("new_key", "new_value"))

And when the pickling time comes, you lock and pickle that list instead of the dict:

list_is_locked = True
# pickle cache_list here
list_is_locked = False
cache_list.extend(queue)
queue.clear()

I'm not really familiar with pickle, but if it's possible to append to cache_file instead of overwriting it, you could also clear the list after pickling it. That would help to avoid that huge memory burden.

Expurple
  • 877
  • 6
  • 13
  • 1
    Very elegant answer. Can the list lock also be removed https://stackoverflow.com/questions/6319207/are-lists-thread-safe as we are ok with stale writes – dark knight Jan 17 '22 at 18:15
  • The lock was intended to protect the iterator during pickling. I was expecting that it would throw the same `"changed size during iteration"` error as dict iterator. But I've actually tried to invalidate a list iterator right now, and it doesn't throw. So yes, you don't need neither a lock nor a queue, only a list. I'll edit my answer – Expurple Jan 17 '22 at 20:46