I have a large number of data files and each data loaded from a data file are resampled hundreds of times and processed by several methods. I used numpy
to do this. What I'm facing is the memory error after running the programs several hours. As each data is processed separately and the results are stored in a .mat
file using scipy.savemat
, I think the memory used by previous data can be released, so I used del variable_name
+gc.collect()
, but this does not work. Then I used multiprocessing
module, as suggested in this post and this post, it still not works.
Here are my main codes:
import scipy.io as scio
import gc
from multiprocessing import Pool
def dataprocess_session:
i = -1
for f in file_lists:
i += 1
data = scio.loadmat(f)
ixs = data['rm_ix'] # resample indices
del data
gc.collect()
data = scio.loadmat('xd%d.mat'%i) # this is the data, and indices in "ixs" is used to resample subdata from this data
j = -1
mvs_ls_org = {} # preallocate results files as dictionaries, as required by scipy.savemat.
mvs_ls_norm = {}
mvs_ls_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
mvs_ls_org[key] = process(X)
scio.savemat('d%d_ls_org.mat'%i,mvs_ls_org)
del mvs_ls_org
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'norm')
mvs_ls_norm[key] = process(X2)
scio.savemat('d%d_ls_norm.mat'%i,mvs_ls_norm)
del mvs_ls_norm
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data,ix)
X2 = scale(X.copy(), 'auto')
mvs_ls_auto[key] = process(X2)
scio.savemat('d%d_ls_auto.mat'%i,mvs_ls_auto)
del mvs_ls_auto
gc.collect()
# use another method to process data
j = -1
mvs_fcm_org = {} # also preallocate variable for storing results
mvs_fcm_norm = {}
mvs_fcm_auto = {}
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
dp, _ = process_2(X.copy())
mvs_fcm_org[key] = dp
scio.savemat('d%d_fcm_org.mat'%i,mvs_fcm_org)
del mvs_fcm_org
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'norm')
dp, _ = process_2(X2.copy())
mvs_fcm_norm[key] = dp
scio.savemat('d%d_fcm_norm.mat'%i,mvs_fcm_norm)
del mvs_fcm_norm
gc.collect()
j = -1
for ix in ixs:
j += 1
key = 'name%d'%j
X = resample_from_data(data['X'].copy(), ix)
X2 = scale(X.copy(), 'auto')
dp, _ = process_2(X2.copy())
mvs_fcm_auto[key] = dp
scio.savemat('d%d_fcm_auto.mat'%i,mvs_fcm_auto)
del mvs_fcm_auto
gc.collect()
This is the initial way I've done. I split file_lists
into 7 parts, and ran 7 python screens, as my computer has 8 CPU cores. No problem in MATLAB if I do in this way. I do not combine the iterations over ixs
for each data process method because the memory error can occur, so I ran resample_from_data
and saved results separately. As the memory error persists, I used Pool
class as:
pool = Pool(processes=7)
pool.map(dataprocess_session_2, file_lists)
which ran the iteration over file_lists
parallelized with file names in file_lists
as inputs.
All codes are run in openSuSE
with python 2.7.5
, 8 cores CPU
and 32G RAM
. I used top
to monitor the memory used. All matrices are not so large and it's ok if I run any one of the loaded data using all codes. But after several iterations over file_lists
, the free memory falls dramatically. I'm sure that this phenomenon is not caused by the data itself since no such large memory should be used even the largest data matrix is in processing. So I suspected that the above ways I tried to release the memory used by processing previous data as well as storing processing results did not really release memory.
Any suggestion?