I have to iteratively read data files and store the data into (numpy) arrays. I chose to store the data into a dictionary of "data fields": {'field1': array1,'field2': array2,...}
.
Case 1 (lists):
Using lists (or collections.deque()
) for "appending" new data arrays, the code is efficient. But, when I concatenate the arrays stored in the lists, the memory grows and I did not manage to free it again. Example:
filename = 'test'
# data file with a matrix of shape (98, 56)
nFields = 56
# Initialize data dictionary and list of fields
dataDict = {}
# data directory: each entry contains a list
field_names = []
for i in xrange(nFields):
field_names.append(repr(i))
dataDict[repr(i)] = []
# Read a data file N times (it represents N files reading)
# file contains 56 fields of arbitrary length in the example
# Append each time the data fields to the lists (in the data dictionary)
N = 10000
for j in xrange(N):
xy = np.loadtxt(filename)
for i,field in enumerate(field_names):
dataDict[field].append(xy[:,i])
# concatenate list members (arrays) to a numpy array
for key,value in dataDict.iteritems():
dataDict[key] = np.concatenate(value,axis=0)
Computing time: 63.4 s
Memory usage (top): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python
Case 2 (numpy arrays):
Concatenating directly the numpy arrays each time they are readed, it is inefficient but memory remains under control. Example:
nFields = 56
dataDict = {}
# data directory: each entry contains a list
field_names = []
for i in xrange(nFields):
field_names.append(repr(i))
dataDict[repr(i)] = np.array([])
# Read a data file N times (it represents N files reading)
# Concatenate data fields to numpy arrays (in the data dictionary)
N = 10000
for j in xrange(N):
xy = np.loadtxt(filename)
for i,field in enumerate(field_names):
dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))
Computing time: 1377.8 s
Memory usage (top): 14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python
Question(s):
Is there any way of having the performance of Case 1 but keeping the memory under control as in Case 2?
It seems that in case 1, the memory grows when concatenating list members (
np.concatenate(value,axis=0)
). Better ideas of doing it?