0

I am currently trying to store some data into .h5 files, I quickly realised that might have to store my data into parts, as it is not possible to process it an have in my ram. I started out using numpy.array to compress the memory usage, but that resulted in days spend on formatting data.

So i went back to use list, but made the program monitor the memory usage, when it was above a specified value, will a part be stored, as a numpy format - such that a another process can load it and make use of it. Problem with doing this, is that what I thought would release my memory isn't releasing the memory. For some reason is the memory the same even though I reset the variable and del the variable. Why isn't the memory being released here?

import numpy as np
import os
import resource
import sys
import gc
import math
import h5py
import SecureString
import objgraph
from numpy.lib.stride_tricks import as_strided as ast

total_frames = 15
total_frames_with_deltas = total_frames*3
dim = 40
window_height = 5


def store_file(file_name,data):
    with h5py.File(file_name,'w') as f:
        f["train_input"] = np.concatenate(data,axis=1)

def load_data_overlap(saved):
    #os.chdir(numpy_train)
    print "Inside function!..."
    if saved == False:
        train_files = np.random.randint(255,size=(1,40,690,4))
        train_input_data_interweawed_normalized = []
        print "Storing train pic to numpy"
        part = 0
        for i in xrange(100000):
            print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
            if resource.getrusage(resource.RUSAGE_SELF).ru_maxrss > 2298842112/10:
                print "Max ram storing part: " + str(part) + " At entry: " + str(i)
                print "Storing Train input"
                file_name = 'train_input_'+'part_'+str(part)+'_'+str(dim)+'_'+str(total_frames_with_deltas)+'_window_height_'+str(window_height)+'.h5'
                store_file(file_name,train_input_data_interweawed_normalized)
                part = part + 1             
                del train_input_data_interweawed_normalized
                gc.collect()
                train_input_data_interweawed_normalized = []
                raw_input("something")
            for plot in train_files:
                overlaps_reshaped = np.random.randint(10,size=(45,200,5,3))
                for ind_plot in overlaps_reshaped.reshape(overlaps_reshaped.shape[1],overlaps_reshaped.shape[0],overlaps_reshaped.shape[2],overlaps_reshaped.shape[3]): 
                    ind_plot_reshaped = ind_plot.reshape(ind_plot.shape[0],1,ind_plot.shape[1],ind_plot.shape[2])
                    train_input_data_interweawed_normalized.append(ind_plot_reshaped)
    print len(train_input_data_interweawed_normalized)

    return train_input_data_interweawed_normalized_print
#------------------------------------------------------------------------------------------------------------------------------------------------------------

saved = False
train_input = load_data_overlap(saved)

output:

.....
223662080
224772096
225882112
226996224
228106240
229216256
230326272
Max ram storing part: 0 At entry: 135
Storing Train input
something
377118720
Max ram storing part: 1 At entry: 136
Storing Train input
something
377118720
Max ram storing part: 2 At entry: 137
Storing Train input
something
J.Down
  • 700
  • 1
  • 9
  • 32
  • In my opinion, this is a bit bulky for a [MCVE](https://stackoverflow.com/help/mcve). Your question is concerned with freeing memory inside a `for` loop by saving a long list to file, then deleting the list. Try cutting everything inside the `for plot in train_files:` loop and just add a list of [random variables](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.random.rand.html) to your list for each point in the loop. Then provide some of the output. – christopherlovell Apr 11 '17 at 17:08
  • added new version and output – J.Down Apr 11 '17 at 17:25
  • you've removed the `del` statements. Also, rather than saving the whole array to a new variable (and taking up twice as much memory in the process), why not save it then just delete it: `h5f.create_dataset('train_input', data=np.concatenate(train_input_data_interweawed_normalized,axis=1))` – christopherlovell Apr 11 '17 at 17:34
  • You mean something like this? : https://pastebin.com/25MtFeii – J.Down Apr 11 '17 at 17:40
  • yes, but without all this: `train_input_data_interweawed_normalized = None train_input_data_interweawed_normalized = [] del h5f`. What is the updated output? – christopherlovell Apr 11 '17 at 17:41
  • well.. I need `train_input_data_interweawed_normalized = []` as it is being used in the for loop.. but it still causes the same problem. – J.Down Apr 11 '17 at 17:45
  • OK, now we're getting somewhere. It looks like the memory is still being doubled, so it's probably got something to do with how `h5py` is storing the data in memory before writing. Depending on your platform, you may want to experiment with different [file drivers](http://docs.h5py.org/en/latest/high/file.html) when initialising your `h5py.File` – christopherlovell Apr 11 '17 at 17:53
  • sooo?.. `h5f = h5py.File('train_input_'+'part_'+str(part)+'_'+str(dim)+'_'+str(total_frames_with_deltas)+'_window_height_'+str(window_height)+'.h5',driver=H5FD_SEC2 ,'w')` – J.Down Apr 11 '17 at 17:58
  • Well.. wait.. using `none` is what is recommended... so what else are you suggesting? – J.Down Apr 11 '17 at 18:01
  • Using `none` is recommended, but may be causing you problems. It depends on your platform. You could also try using `with` to open the file, as explained [here](http://stackoverflow.com/questions/29863342/close-an-open-h5py-data-file), which explicitly closes the file. Or handle each file with a function call, as described [here](http://stackoverflow.com/questions/7675838/python-hdf5-h5py-issues-opening-multiple-files?rq=1). – christopherlovell Apr 11 '17 at 18:10
  • `def store_file(file_name,data): with h5py.File(file_name,'w') as f: f["train_input"] = np.concatenate(data,axis=1)` still same problem.. – J.Down Apr 11 '17 at 18:21
  • It seem that the memory increase occur when i np.concatenate the data.. I don't think it is due to `h5py`. – J.Down Apr 11 '17 at 18:22
  • which versions of hdf5 and h5py are you using? – christopherlovell Apr 11 '17 at 18:23
  • `np.concatenate` do use the double memory... so it makes sense that is the killer. but it is not being removed. Versions are... – J.Down Apr 11 '17 at 18:26
  • `>>> h5py.__version__ '2.7.0'` – J.Down Apr 11 '17 at 18:27
  • i don't use hdf5 – J.Down Apr 11 '17 at 18:28
  • "i don't use hdf5" - if you're using OSX, I think you do. To test if it's the concatenation, lower your memory threshold before writing to disk. Set it to something really low and see if it completes a write to disk correctly. – christopherlovell Apr 11 '17 at 18:32
  • I am not able to put it to a low value due to the `range` taking a lot of space... – J.Down Apr 11 '17 at 18:36
  • The memory is the same after one save.. the memory never decrement from its last value, meaning something isn't getting probably deleted. – J.Down Apr 11 '17 at 18:43
  • Warning: homebrew/science/hdf5-1.8.16_1 already installed – J.Down Apr 11 '17 at 18:55
  • This thread has already gone on far too long in the comments. moved to [chat](http://chat.stackexchange.com/rooms/info/56955/releasing-memory-usage-by-variable-in-python?tab=general) – christopherlovell Apr 11 '17 at 19:06

1 Answers1

0

You need to explicitly force garbage collection, see here:

According to Python Official Documentation, you can force the Garbage Collector to release unreferenced memory with gc.collect()

Community
  • 1
  • 1
christopherlovell
  • 3,800
  • 4
  • 19
  • 26
  • OK, did you do it inside the loop after each `del`? Can you provide (in the question text) the memory usage as the program runs? Or, if you're feeling generous. a [MCVE](https://stackoverflow.com/help/mcve)? – christopherlovell Apr 11 '17 at 16:49
  • The example given above is a minimum complete working example - Randomly generated data - The actually process i am doing, and the storing. You should be able to run the code as it is. I used the garbage collector just before raw_input("Something!") And then saw that memory usage increased from last value.. I will add the output of the code. – J.Down Apr 11 '17 at 16:56
  • see comment on question – christopherlovell Apr 11 '17 at 17:38