1

I'm using ZODB and ZEO, and storing a Btree() in it with ~25 million objects whose keys are text strings of varying lengths. In order to iterate over the objects in a "safe and predictable way", I follow the advice of the BTrees documentation and first make copy of the list of keys, like so (where db is the BTree object):

for key in list(db.keys()):
    ...do stuff...

However, the creation of that list of keys takes a relatively long time, even on a recent multicore server-grade system with 48 GB of RAM (running CentOS 7, bare metal, not in a VM). If I separately time how long it takes simply to do list(db.keys()), it takes 9-10 minutes. In terms of size, sys.getsizeof(list(db.keys())) reports 220 MB, which is consistent with the expected size of a list of 25 million strings whose lengths vary from 2-70 characters (approximately).

Is there a faster way to do the step of copying the keys to a list, or alternatively, is there a better approach to iterating over the BTree elements in a way that is safe in case other processes are adding objects to the BTree?

I have tried to research faster copy approaches in Python, but what I have found has been focused on copying lists or dictionaries (e.g., past SO questions here and here), not on "listifying" the keys from a BTree.

Community
  • 1
  • 1
mhucka
  • 2,143
  • 26
  • 41
  • Are you actually mutating the dictionary (adding or removing keys)? If not, there is no need to create a copy of the keys. – Martijn Pieters Dec 28 '15 at 20:31
  • `sys.getsizeof()` only reports the size of the list, *not the strings contained in it*. The 220MB is for the pointers in the C structure, what those pointers reference (the strings) is not included in the total. You'd have to use `sys.getsizeof([None] * len(db)) + sum(sys.getsizeof(k) for k in db)`. – Martijn Pieters Dec 28 '15 at 20:32
  • Last but not least, a good part of the slowess comes from you loading 25 million objects from the ZODB. You are I/O bound here. Try to avoid having to process all keys in one step. Use the `min` or `max` keywords to the `.keys()` method to process keys in batches perhaps, to break up the processing into manageable chunks. – Martijn Pieters Dec 28 '15 at 20:40
  • @MartijnPieters (1) Keys+values are indeed being added. But, if it makes a difference to potential efficiency tweaks, no keys will be deleted. (2) Darn, I thought that `sys.getsizeof()` would report the strings contained in the list. Thank you for the tip on how to calculate the size better. (3) Thank you for the suggestion to break things up into batches. – mhucka Dec 28 '15 at 20:46
  • Then perhaps consider adding those keys and values to a separate structure, and merge the trees after you are done. You don't have to create a copy of the current keys that way. – Martijn Pieters Dec 28 '15 at 20:47

0 Answers0