Following the suggestions given here, I have stored my data using ZODB, created by the following piece of code:
# structure of the data [around 3.5 GB on disk]
bTree_container = {key1:[ [2,.44,0], [1,.23,0], [4,.21,0] ...[10,000th element] ], key2:[ [3,.77,0], [1,.22,0], [6,.98,0] ..[10,000th element] ] ..10,000th key:[[5,.66,0], [2,.32,0], [8,.66,0] ..[10,000th element]]}
# Code used to build the above mentioned data set
for Gnodes in G.nodes(): # Gnodes iterates over 10000 values
Gvalue = someoperation(Gnodes)
for i,Hnodes in enumerate(H.nodes()): # Hnodes iterates over 10000 values
Hvalue =someoperation(Hnodes)
score = SomeOperation on (Gvalue,Hvalue)
btree_container.setdefault(Gnodes, PersistentList()).append([Hnodes, score, 0]) # build a list corresponding to every value of Gnode (key)
if i%5000 == 0 # save the data temporarily to disk.
transaction.savepoint(True)
transaction.commit() # Flush all the data to disk
Now, I want to (in a separate module) (1) modify the stored data and (2) sort it. Following is the code that I was using:
storage = FileStorage('Data.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
sim_sorted = root[0]
# substitute the last element in every list of every key (indicated by 0 above) by 1
# This code exhausts all the memory, never get to the 2nd part i.e. the sorting
for x in sim_sorted.iterkeys():
for i,y in enumerate(sim_sorted[x]):
y[3] = 1
if i%5000 ==0
transaction.savepoint()
# Sort all the lists associated with every key in he reverse order using middle element as key
[sim_sorted[keys].sort(key = lambda x:(-x[1])) for keys in sim_sorted.iterkeys()]
However, the code used for editing the value is eating up all the memory (never get to sorting). I am not sure how this works, but have a feeling that there is something terribly wrong with my code and ZODB is pulling everything into memory and hence the issue. What would be the correct method to achieve the desired effect i.e the substitution and sorting of stored elements in ZODB without running into memory issues? Also the code is very slow, suggestion to quicken it up ?
[Note: It's not necessary for me to write these changes back to the database]
EDIT
There seems to be a little improvement in memory usage by adding the command connection.cacheMinimize()
after the inner loop, however again after some time the entire RAM is consumed, which is leaving me puzzled.