1

I have a large number of data files and each data loaded from a data file are resampled hundreds of times and processed by several methods. I used numpy to do this. What I'm facing is the memory error after running the programs several hours. As each data is processed separately and the results are stored in a .mat file using scipy.savemat, I think the memory used by previous data can be released, so I used del variable_name+gc.collect(), but this does not work. Then I used multiprocessing module, as suggested in this post and this post, it still not works.

Here are my main codes:

import scipy.io as scio
import gc
from multiprocessing import Pool

def dataprocess_session:
    i = -1
    for f in file_lists:
        i += 1
        data = scio.loadmat(f)
        ixs = data['rm_ix'] # resample indices
        del data
        gc.collect()
        data = scio.loadmat('xd%d.mat'%i) # this is the data, and indices in "ixs" is used to resample subdata from this data

        j = -1
        mvs_ls_org = {} # preallocate results files as dictionaries, as required by scipy.savemat.
        mvs_ls_norm = {}
        mvs_ls_auto = {}
        for ix in ixs:
            j += 1
            key = 'name%d'%j
            X = resample_from_data(data,ix)
            mvs_ls_org[key] = process(X)

        scio.savemat('d%d_ls_org.mat'%i,mvs_ls_org)
        del mvs_ls_org
        gc.collect()

        j = -1
        for ix in ixs:
            j += 1
            key = 'name%d'%j
            X = resample_from_data(data,ix)
            X2 = scale(X.copy(), 'norm')
            mvs_ls_norm[key] = process(X2)

        scio.savemat('d%d_ls_norm.mat'%i,mvs_ls_norm)
        del mvs_ls_norm
        gc.collect()

        j = -1
        for ix in ixs:
            j += 1
            key = 'name%d'%j
            X = resample_from_data(data,ix)
            X2 = scale(X.copy(), 'auto')
            mvs_ls_auto[key] = process(X2)

        scio.savemat('d%d_ls_auto.mat'%i,mvs_ls_auto)
        del mvs_ls_auto
        gc.collect()

        # use another method to process data
        j = -1
        mvs_fcm_org = {} # also preallocate variable for storing results
        mvs_fcm_norm = {}
        mvs_fcm_auto = {}
        for ix in ixs:
            j += 1
            key = 'name%d'%j
            X = resample_from_data(data['X'].copy(), ix)
            dp, _ = process_2(X.copy())
            mvs_fcm_org[key] = dp

        scio.savemat('d%d_fcm_org.mat'%i,mvs_fcm_org)
        del mvs_fcm_org
        gc.collect()

        j = -1
        for ix in ixs:
            j += 1
            key = 'name%d'%j
            X = resample_from_data(data['X'].copy(), ix)
            X2 = scale(X.copy(), 'norm')
            dp, _ = process_2(X2.copy())
            mvs_fcm_norm[key] = dp

        scio.savemat('d%d_fcm_norm.mat'%i,mvs_fcm_norm)
        del mvs_fcm_norm
        gc.collect()

        j = -1
        for ix in ixs:
            j += 1
            key = 'name%d'%j
            X = resample_from_data(data['X'].copy(), ix)
            X2 = scale(X.copy(), 'auto')
            dp, _ = process_2(X2.copy())
            mvs_fcm_auto[key] = dp

        scio.savemat('d%d_fcm_auto.mat'%i,mvs_fcm_auto)
        del mvs_fcm_auto
        gc.collect()

This is the initial way I've done. I split file_lists into 7 parts, and ran 7 python screens, as my computer has 8 CPU cores. No problem in MATLAB if I do in this way. I do not combine the iterations over ixs for each data process method because the memory error can occur, so I ran resample_from_data and saved results separately. As the memory error persists, I used Pool class as:

pool = Pool(processes=7)
pool.map(dataprocess_session_2, file_lists)

which ran the iteration over file_lists parallelized with file names in file_lists as inputs.

All codes are run in openSuSE with python 2.7.5, 8 cores CPU and 32G RAM. I used top to monitor the memory used. All matrices are not so large and it's ok if I run any one of the loaded data using all codes. But after several iterations over file_lists, the free memory falls dramatically. I'm sure that this phenomenon is not caused by the data itself since no such large memory should be used even the largest data matrix is in processing. So I suspected that the above ways I tried to release the memory used by processing previous data as well as storing processing results did not really release memory.

Any suggestion?

Community
  • 1
  • 1
Elkan
  • 546
  • 8
  • 23
  • Why I was downvoted? – Elkan Jan 16 '17 at 10:40
  • I really need to read up on this properly myself, but I don't think you can force the hand of the `gc` in this way. Python uses reference counting to decide whether objects should be deleted, so `gc.collect()` probably isn't removing those objects immediately. You might have more hope if you were to break each of those loops down into their own functions that do not return the data but again, you should only take this as an aid to do further reading on python gc as I could be well off on that suggestion. Hopefully someone more knowledgeable can clarify. – roganjosh Jan 16 '17 at 10:53
  • Update to your suggestion: I'm now trying, though no error occurs, but after running my datasets for these two days, only several KB of memory (comparing to 32G RAM of my PC) is left free. As the data processed have relatively small size, I'm afraid that if I used in this way, memory can still be exhausted. – Elkan Jan 18 '17 at 13:58

2 Answers2

0

All the variables you del explicitly are automatically released as soon as the loop ends. Subsequently i don't think they are your problem. I think it's more likely that your machine simply can't handle 7 threads with (in worst case) 7 simultaneously executed data = scio.loadmat(f). You could try to mark that call as a criticial section with locks.

Max Uppenkamp
  • 974
  • 4
  • 16
  • Thanks. Can you explain more about this? It's no problem that any 7 data are loaded into memory simultaneously, so what does "machine can't handle 7 threads with 7 simultaneously executed data = scio.loadmat(f)" mean? – Elkan Jan 16 '17 at 10:44
  • It's not only 7 data. You need to think about the worst case, in which all threads are using their maximum amount of memory. I can't tell you what objects get the biggest, you might be able to profile them (best in single thread mode). – Max Uppenkamp Jan 16 '17 at 10:54
  • I see. I used MATLAB do same works to same datasets but used different processed methods, which was specified by `process` function in my python code, even more are loaded into memory in MATLAB, no error exists after running for days. As my matrix is not so large and the returned `X` and `X2` have same size with original matrix, and still have large free space even after iteration through `ixs`. This is monitored by `top` in linux. I have noticed that python still keeps `free lists` after `del`, but I did not find any effective solution. So I suspect that whether I missed something. – Elkan Jan 16 '17 at 11:16
-3

this may be of some help, gc.collect()

Nitin
  • 147
  • 2
  • 10
  • 1
    "I think the memory used by previous data can be released, so I used del variable_name+gc.collect(), but this does not work." – roganjosh Jan 16 '17 at 10:31