3

Following the suggestions given here, I have stored my data using ZODB, created by the following piece of code:

# structure of the data [around 3.5 GB on disk]
bTree_container = {key1:[ [2,.44,0], [1,.23,0], [4,.21,0] ...[10,000th element] ], key2:[ [3,.77,0], [1,.22,0], [6,.98,0] ..[10,000th element] ] ..10,000th key:[[5,.66,0], [2,.32,0], [8,.66,0] ..[10,000th element]]}

# Code used to build the above mentioned data set
for Gnodes in G.nodes():      # Gnodes iterates over 10000 values 
Gvalue = someoperation(Gnodes)
    for i,Hnodes in enumerate(H.nodes()):  # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append([Hnodes, score, 0]) # build a list corresponding to every value of Gnode (key)
        if i%5000 == 0       # save the data temporarily to disk.
           transaction.savepoint(True)
transaction.commit()         # Flush all the data to disk

Now, I want to (in a separate module) (1) modify the stored data and (2) sort it. Following is the code that I was using:

storage = FileStorage('Data.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
sim_sorted = root[0]

# substitute the last element in every list of every key (indicated by 0 above) by 1
# This code exhausts all the memory, never get to the 2nd part i.e. the sorting 
for x in sim_sorted.iterkeys():
    for i,y in enumerate(sim_sorted[x]):
        y[3] = 1
        if i%5000 ==0
            transaction.savepoint()

# Sort all the lists associated with every key in he reverse order using middle element as key   
[sim_sorted[keys].sort(key = lambda x:(-x[1])) for keys in sim_sorted.iterkeys()]

However, the code used for editing the value is eating up all the memory (never get to sorting). I am not sure how this works, but have a feeling that there is something terribly wrong with my code and ZODB is pulling everything into memory and hence the issue. What would be the correct method to achieve the desired effect i.e the substitution and sorting of stored elements in ZODB without running into memory issues? Also the code is very slow, suggestion to quicken it up ?

[Note: It's not necessary for me to write these changes back to the database]

EDIT There seems to be a little improvement in memory usage by adding the command connection.cacheMinimize() after the inner loop, however again after some time the entire RAM is consumed, which is leaving me puzzled.

Community
  • 1
  • 1
R.Bahl
  • 399
  • 6
  • 18

1 Answers1

1

Are you certain it's not the sorting that's killing your memory?

Note that I'd expect that each PersistentList has to fit into memory; it is one persistent record so it'll be loaded as a whole on access.

I'd modify your code to run like this and see what happens:

for x in sim_sorted.iterkeys():
    for y in sim_sorted[x]:
        y[3] = 1
    sim_sorted[x].sort(key=lambda y: -y[1])
    transaction.savepoint()

Now you process the whole list in one go and sort it; after all, it's already loaded into memory in one. After processing, you tell the ZODB you are done with this stage and the whole changed list will be flushed to temporary storage. There is little point flushing it when only half-way done.

If this still doesn't fit into memory for you, you'll need to rethink your data structure and split up the large lists into smaller persistent records so you can work on chunks of it at a time without loading the whole thing in one.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Martjin: That's really strange ! Although, I am sure that it was not the sorting, since I have put check point in my code and it never reached the sorting section, but somehow the paradigm that you recommended seems to work ! Would you recommend using the connection.cacheMinimize() method ? – R.Bahl Jul 05 '12 at 14:17
  • Martjin: I am sure that one list is small enough to fit into memory ( after all i am building it that way). Although the above works to an extent, but still most of my RAM is slowly being converted into *inactive memory*. I am not sure on how to clear this up. Any comments ? – R.Bahl Jul 05 '12 at 15:45
  • First of all: not sure if `cacheMinimize` will do much; it'll clean up any objects that have not been modified. But since you are only loading objects that you *do* modify I think it'll not make much odds. Secondly: Don't worry about inactive memory; your OS will reclaim that when needed. This is a strategy where processes keep memory even though they don't currently use it, on the basis that they'll soon re-use it. – Martijn Pieters Jul 05 '12 at 15:50
  • Thanks for the suggestions ! I have an another piece of code which follows the one mentioned above, where I just read the data, probably will use `cacheMinimize()` over there – R.Bahl Jul 05 '12 at 16:47
  • Martjin: I am observing something very peculiar. If I do it the way you recommend, it works out and memory is not cluttered. However, I just wanted to apply the changes and not do the sorting, and its not working out, memory is getting killed. I am really puzzled as to why this would happen, I check the database to see if its getting modified and it is. Any idea on why this would happen ? – R.Bahl Jul 06 '12 at 07:20
  • (Note: we are stretching SO to it's limits again, perhaps a new Q is in order). The ZODB only writes when you `.commit()`. Try `.abort()` instead, and check `connection.getTransferCounts()`; the latter returns a tuple of `(loaded, stored)` objects on the connection; you can clear the counts with `connection.getTransferCounts(clear=True)`, then later assert that the stored count is still 0 to make sure your code really doesn't write anything. – Martijn Pieters Jul 06 '12 at 07:56