Python lists/dictionaries vs. numpy arrays: performance vs. memory control

Question

I have to iteratively read data files and store the data into (numpy) arrays. I chose to store the data into a dictionary of "data fields": {'field1': array1,'field2': array2,...}.

Case 1 (lists):

Using lists (or collections.deque()) for "appending" new data arrays, the code is efficient. But, when I concatenate the arrays stored in the lists, the memory grows and I did not manage to free it again. Example:

filename = 'test'
# data file with a matrix of shape (98, 56)
nFields = 56
# Initialize data dictionary and list of fields
dataDict = {}

# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = []

# Read a data file N times (it represents N files reading)
# file contains 56 fields of arbitrary length in the example
# Append each time the data fields to the lists (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field].append(xy[:,i])

# concatenate list members (arrays) to a numpy array 
for key,value in dataDict.iteritems():
    dataDict[key] = np.concatenate(value,axis=0)

Computing time: 63.4 s
Memory usage (top): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python

Case 2 (numpy arrays):

Concatenating directly the numpy arrays each time they are readed, it is inefficient but memory remains under control. Example:

nFields = 56
dataDict = {}
# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = np.array([])

# Read a data file N times (it represents N files reading)
# Concatenate data fields to numpy arrays (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))

Computing time: 1377.8 s
Memory usage (top): 14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python

Question(s):

Is there any way of having the performance of Case 1 but keeping the memory under control as in Case 2?
It seems that in case 1, the memory grows when concatenating list members (np.concatenate(value,axis=0)). Better ideas of doing it?

Numpy's concatenate is creating a whole new Numpy array every time that you use it. The point of Numpy arrays is to preallocate your memory. If you aren't doing that, then you aren't using Numpy very wisely. That is the reason for the slowness in the Numpy example. — Justin Peel, Feb 08 '11 at 17:10
@Justin: I cannot preallocate the numpy arrays because i don't know the length they are goona have previously. That's why i would prefer to use _lists_. The problem arises when I convert lists to numpy arrays: the memory usage irremediably grows. — chan gimeno, Feb 08 '11 at 17:22

score 3 · Answer 1 · answered Feb 08 '11 at 21:10

Here's what is going on based on what I've observed. There isn't really a memory leak. Instead, Python's memory management code (possibly in connection with the memory management of whatever OS you are in) is deciding to keep the space used by the original dictionary (the one without the concatenated arrays) in the program. However, it is free to be reused. I proved this by doing the following:

Making the code you gave as an answer into a function that returned dataDict.
Calling the function twice and assigning the results to two different variables.

When I do this, I find that the amount of memory used only increased from ~900 GB to ~1.3 GB. Without the extra dictionary memory, the Numpy data itself should take up about 427 MB by my calculations so this adds up. The second initial, unconcatenated dictionary that our function created just used the already allocated memory.

If you really can't use more than ~600 MB of memory, then I would recommend doing with your Numpy arrays somewhat like what is done internally with Python lists: allocate an array with a certain number of columns and when you've used those up, create an enlarged array with more columns and copy the data over. This will reduce the number of concatenations, meaning it will be faster (though still not as fast as lists), while keeping the memory used down. Of course, it is also more of a pain to implement.

I came to a similar conclusion, testing the code differently: using intermediate dictionaries. _Python's memory management code (possibly in connection with the memory management of whatever OS you are in) is deciding to keep the space used by the original dictionary (the one without the concatenated arrays) in the program._ I tried the code with Linux and MacOS with the same result. I'm still wondering why python decides to keep the space of the deleted dictionary. Probably, it is not a memory leakage but i still don't see the practical use of this. Thanks for the recommendations! — chan gimeno, Feb 08 '11 at 21:44

Python lists/dictionaries vs. numpy arrays: performance vs. memory control

Case 1 (lists):

Case 2 (numpy arrays):

Question(s):

1 Answers1

Linked