1

I have about 200 numpy arrays saved as files, and I would like to combine them into one big array. Currently I am doing that by using a loop and concatenating each one individually. But I heard this is memory inefficient, because concatenating also makes a copy

Concatenate Numpy arrays without copying

If you know beforehand how many arrays you need, you can instead start with one big array that you allocate beforehand, and have each of the small arrays be a view to the big array (e.g. obtained by slicing).

So I am wondering if I should instead load each numpy array individually, count the row size off all the numpy arrays, create a new numpy array of this new row size, and then copy each smaller numpy array individually, and then delete that numpy array. Or is there some aspect of this I am not taking into account?

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • 1
    That depends on specifics. E.g. how near you are in terms of memory limitation. What you describe is the optimal approach in terms of peak memory usage (ignoring additional on-the-fly processing of inner arrays), but it is a 2 pass process (so you are paying for time and IO). Appending all of those arrays to a python list before creating a numpy-array in one go consuming the list of arrays will be an alternative way consuming more memory, but not needing 2 passes, no resize-copies and more clean code. – sascha Aug 28 '19 at 20:56
  • 1
    It is not memory inefficient. It is computationally inefficient because to ensure that the array is contiguous requires allocating and **copying** the arrays for each concatenation. If you are worried about computational efficiency, you can proceed as you described. If you want to improve your memory efficiency you should use a two pass approach: (1) determine the size of the output by looping through your arrays (unloading arrays at each iteration); (2) populate your output by looping through your arrays again (unloading arrays at each iteration again). – norok2 Aug 28 '19 at 20:57
  • 1
    @sascha I believe that appending to a Python list just hides the second pass in the `list` to NumPy array conversion. – norok2 Aug 28 '19 at 21:02
  • 1
    (continued) Note that in the first looping you could probably use [`np.memmap()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) since you are only interested in the shape. – norok2 Aug 28 '19 at 21:05
  • 1
    Fill us in a couple of details. What file format are you loading, and which function are you using? What's the shape of each array? Same for all? Same number of columns but different rows? What shape should end result have? And how about the `dtype`? – hpaulj Aug 29 '19 at 00:11
  • I am loading csv files, using pandas read_csv, each row contains a column for an int, and then another column for a list of ints. Each of these ints vary, they can be from 2 to 200. For dtype the first column is an int, but I have been playing around with using list and numpy arrays for the 2nd column. The former can save to a smaller space, but the later takes up less RAM – SantoshGupta7 Aug 29 '19 at 00:14
  • 1
    So what's the combined array supposed to look like? I think all the comments assumed you were loading 200 (n,m) shaped numeric arrays, and wanted a (200,n,m) shaped result. – hpaulj Aug 29 '19 at 01:09
  • there will be about 14 million rows. The first column of each row is a single int. The second column is a list of numpy array of ints, which range from 2 to 200. – SantoshGupta7 Aug 29 '19 at 01:11
  • 1
    So you'll have a (14M, 2) shaped object dtype array. Each element of the array points to one of those 14M small lists/arrays. Most of the memory use will be in those small arrays which will be scattered through out memory. All your joining does it combine the rows of pointers. Most comments about efficient memory use and copies don't apply. – hpaulj Aug 29 '19 at 01:48

0 Answers0