1

Sorry another newbie query :| To build upon the suggestion which was given here, optimizing

I need to be able to incrementally build a dictionary i.e. one key: value at a time inside a for loop. To be specific, the dictionary would look something like (N keys, with each value being a list of lists. The smaller inner list has 3 elements):

dic_score ={key1:[ [,,], [,,], [,,] ...[,,] ], key2:[ [,,], [,,], [,,] ..[,,] ] ..keyN:[[,,], [,,], [,,] ..[,,]]}

This dic is being generated from the following paradigm, a nested for loop.

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])

I then need to sort these lists, but the answer for that was given here, optimizing (use of generator expression in place of the inner loop is an option)
[Note that the dic would contain 10000 keys with each key associated with a 10000 elements of smaller lists]

Since the loop counters are big, the dictionary generated is huge and I am running out of memory.

How can I write the write the Key:value (a list of lists) as soon as it is generated to a file, so that I don't need to hold the entire dictionary in memory. I then want to be able to read back the dictionary in the same format i.e. something like dic_score_after_reading[key], returns me the list of list I am looking for.

I am hopping that doing this writing and reading per key:value would considerably ease the memory requirements. Is there a better data structure to do this? Shall I be considering a database , probably like Buzhug, which would give me the flexibility to access and iterate over lists associated with each key ?

I am currently using cPickle to dump the entire dictionary and then reading it back via load(), but cPickle crashes while dumping such a big data in one go.

Apologies, but I am unaware of the best practices to do this type of stuff. Thanks !

Community
  • 1
  • 1
R.Bahl
  • 399
  • 6
  • 18

1 Answers1

1

You could look into using the ZODB in combination with the included BTrees implementation.

What that gives is a mapping-like structure that writes individual entries separately to the object store. You'd need to use savepoints or plain transactions to flush data out to the storage, but you can handle huge amounts of data this way.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks for the quick response ! Let me digg into this as I am not aware on how these work. – R.Bahl Jun 27 '12 at 19:36
  • @Martjin : I tried to implement the ZODB in combination with Btree, as u suggested, but I am running into a small issue. When I open the database for reading , I find that some of the records are missing. Specifically, as mentioned above I store my dic. in the form of a Btree and attach it to the root. Btree contains say 20 keys with each key corresponding to list of list ( as written in the original question ). However, some list elements go missing when I open the data base again , after having closed it once. I am using transaction.commit() after every time I append to the key. – R.Bahl Jun 28 '12 at 04:28
  • in the inner loop, after every iteration,I do btree.setdefault(key,[]).append([ref_node, score, -1]) ; transaction.commit() , so that the list associated with each key build incrementally. I see the full list when I am done, but it's when I reopen that I find missing elements in the list of list[refer original post]. Any ideas/suggestions as to what might be going wrong ? Thanks in advance ! – R.Bahl Jun 28 '12 at 04:29
  • Ah, you use lists; these are mutable and the ZODB doesn't notice the changes and won't commit those. See [ZODB not able to commit](http://stackoverflow.com/q/5704589) , http://zodb.org/documentation/guide/prog-zodb.html#rules-for-writing-persistent-classes, http://zodb.org/documentation/articles/ZODB1.html#mutable-attributes, and http://zodb.org/documentation/guide/modules.html#PersistentList. – Martijn Pieters Jun 28 '12 at 08:30
  • Thanks for the response ! I introduced the command `root[0]._p_changed = 1` before `transaction.commit()`, which worked when I had small number of keys ( 20-30). As I jumped to more keys, >100, again I am observing the same issue, that in some of the keys , some list elements are missing. Any other suggestion as to whats going on here ? – R.Bahl Jun 28 '12 at 16:27
  • The BTree uses buckets, which are themselves persistent, when you get to more items. You'd have to call `_p_changed` on those, but that'll be tricky. Use presistent lists or reassignment (instead of .append) instead. – Martijn Pieters Jun 28 '12 at 16:34
  • Thanks a lot !declaring the lists that I was storing in Btree as PersistentList did it for me ! I had couple couple more newbie questions and would be glad if you could provide some insights on them. I am currently commiting() only after both my loops have completely run and the entire btree is ready. As I understand nothing will be written to the DB unless I do the commit. Does that mean that everything is still stored in RAM. In that case will is end up with the same problem of storing huge dictionary in RAM ? (2) How fast the process can be if I just write everything on a .txt file ? – R.Bahl Jun 28 '12 at 20:58
  • You know, we are really straining the SO format here. You should really make those questions into new SO questions. That way others can help out too. :-) – Martijn Pieters Jun 28 '12 at 21:17
  • I agree :-) ! Thanks a lot for your guidance ! – R.Bahl Jun 28 '12 at 21:22