Building a Numpy array by appending data (without knowing the full size in advance)

Question

I have many files, and each file is read as a matrix of shape (n, 1000), where n may be different from file to file.

I'd like to concatenate all of them into a single big Numpy array. I currently do this:

dataset = np.zeros((100, 1000))
for f in glob.glob('*.png'):
    x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
    dataset = np.vstack((dataset, x))

but it is inefficient, since I redefine dataset many times by stacking the existing array with the next file that is read.

How to do this in a better way with Numpy, avoiding that the whole dataset is rewritten in memory many times?

NB: the final big Numpy array might take 10 GB.

Does this answer your question? [Efficient way for appending numpy array](https://stackoverflow.com/questions/24439137/efficient-way-for-appending-numpy-array) — Chris Tang, Jun 24 '21 at 15:54
Congratulations for recognizing that the repeated `vstack` copies the data many times. This topic has come up many times, and the recommendation is to build the list of arrays first. List append is relatively fast and avoids all those copies. Do just one `vstack` at the end. — hpaulj, Jun 24 '21 at 16:12
@ChrisTang, there's got to be a newer and better duplicate (not that anything has changed in this regard since 2014). — hpaulj, Jun 24 '21 at 16:20
Does this answer your question? [Fastest way to grow a numpy numeric array](https://stackoverflow.com/questions/7133885/fastest-way-to-grow-a-numpy-numeric-array) — user202729, Oct 25 '21 at 00:57

Gulzar · Accepted Answer · 2021-06-30T15:57:18.570

Use a native list of numpy arrays, then np.concatenate.

The native list will multiply (by ~1.125) in size when needed, so not too many reallocations will occur, moreover, it will only hold pointers to scattered (non contiguous in memory) np.arrays holding the actual data.

Calling concatenate only once will solve your problem.

Pseudocode

dataset = []
for f in glob.glob('*.png'):
    x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
    dataset.append(x)

dataset_np = np.concatenate(dataset)

Notice vstack internally uses concatenate.

Edit to address the edited question:

Let's say the total size of data is 20 GB. When concatenating, the system will have to still keep 20 GB (for each individual array) and also allocate 20 GB for the new concatenated array, thus requiring 40 GB of RAM (double of the dataset). How to do this without requiring the double of RAM? (Example: is there a solution if we only have 32 GB of RAM?)

I would attack this problem first by doing the same as proposed in this current answer, in phases. dataset_np1 = np.concatenate(half1_of_data), dataset_np2 = np.concatenate(half2_of_data), will only need 150% RAM (not 200%). This can be extended recursively at the expense of speed until the limit at which this becomes the proposition in the question. I can only assume the likes of dask can handle this better, but haven't tested myself.

Just to clarify, after you have dataset_np1 you no longer need the list of all the sharded small arrays, and can free that. Only then you start loading the other half. Thus, you only ever need to hold an extra 50% of the data in memory.

Pseudocode:


def load_half(buffer: np.array, shard_path: str, shard_ind: int):
    half_dataset = []
    for f in glob.glob(f'{shard_path}/*.png'):
        x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
        half_dataset.append(x)

    half_dataset_np = np.concatenate(half_dataset) # see comment *
    buffer[:buffer.shape[0] // 2 * (shard_ind + 1), ...] = half_dataset_np

half1_path = r"half1"  # preprocess the shards to be found by glob or otherwise
half2_path = r"half2"
assert os.path.isdir(half1_path)
assert os.path.isdir(half2_path)

buffer = np.zeros(size_shape)
half1_np = load_half(half1_path, buffer, 0) # only 50% of data temporarily loaded, then freed [can be done manually if needed]
half2_np = load_half(half2_path, buffer, 1) # only 50% of data temporarily loaded, then freed

One could (easily, or not so easily) generalize this to quarters, eighths, or recursively any required fraction to reduce memory costs at the expense of speed, with the limit at infinity being the original proposition in the question.

Important comment (see "see comment * in the code):
One might notice half_dataset_np = np.concatenate(half_dataset) actually allocates 50% of the dataset, with the other 50% allocated in shards, apparently saving us nothing. That is correct, and I could not find a way to concat into a buffer. However, implementing this recursively as suggested (and not shown in pseudocode) will save memory, as a quarter will only use 2* 25% every time. This is just an implementation detail, but I hope the gist is clear.

On a different note, another approach would state "what if the dataset is 1000GB"? then no numpy array will do. This is why databases exist, and they can be queried quite efficiently using tools. But again, this is somewhat a research question, and depends heavily on your specific needs. As a very uninformed hunch, I would check out dask.

Such libraries will obviously tackle problems like this one as a subset of what they do, and I recommend not implementing these things yourself, as the total time you will spend will much outweigh the time choosing and learning a library.

On another different note, I wonder if this really has to be such a huge array, and maybe a slightly different design or formulation of the problem could alleviate us from this technical issue altogether.

Let's say the total size of data is 20 GB @Gulzar. Then when running the line `dataset_np = np.concatenate(dataset)`, the system will have to still keep 20 GB (for each individual array) *and also* allocate 20 GB for the new array `dataset_np`, thus requiring 40 GB of RAM. Let's say we have only 32 GB of RAM. How would you do that? (i.e. without needing the double of RAM than the size of dataset) — Basj, Jun 30 '21 at 14:54
@Basj Please see edit. If you think something is still missing, please say so and I will come back to this. — Gulzar, Jun 30 '21 at 15:46

score 3 · Answer 2 · answered Jun 24 '21 at 15:54

3

Pre-allocate your array. Loop through the files you want to add up front and add up the amount of data you'll retrieve from each. Then create your dataset array with the total size you'll need. Finally loop through the files again inserting the data into the already-allocated array. This will be more efficient than allocating new arrays and copying data from previous files over and over for each additional file.

Alternatively don't build a 10GB array. Consider modifying your operations so that they are compatible on smaller chunks of data and read in more manageable data sets on demand.

answered Jun 24 '21 at 15:54

Woodford

3,746
1
15
29

2

Not an answer - size is not known in advance. – Gulzar Jun 24 '21 at 16:30
2

@Gulzar That's why step 1 is to calculate the size. Nothing in the question precludes being able to do this. – Woodford Jun 24 '21 at 16:36
1

Looping twice over 10GB... I don't consider this viable. You can't always loop only through metadata, for example if the data comes as a stream. Moreover, if it were possible to get only metadata, this wouldn't be a limitation in the question. – Gulzar Jun 24 '21 at 16:39
@Gulzar Opening a file and determining how much data is contained within does not require accessing each byte. The data are .png files as stated in the question. – Woodford Jun 24 '21 at 16:42
Opening 1k+ files is very time consuming even without reading the data, as it goes through the OS, and the files are often not on a local hard drive. I am not convinced this will not much slower than it has to be. – Gulzar Jun 24 '21 at 16:48
Even when the final array size is known, the list append approach of the other answer is competative with this assignment to a preallocated array. – hpaulj Jun 24 '21 at 20:59

score 2 · Answer 3 · edited Jul 07 '21 at 12:25

I think you need the concept of mmap here. Numpy has inbuilt support of mmap. I like the way @Gulzar has mentioned the sharding to process your files in chunks. I have not worked with mmap() but if you see their documentation it seems that you don't need to have the whole file in the memory. You can write to it via append modes and you can manipulate it that way. Also, if you are concerned with time constraint for this action, as sharding will keep increasing the processing time, you should consider a distributed computation architecture for this procedure. In that way, you can for example use 10 machines, which each have smaller memories, but appends chunks to the bigger memory. I know this is not a complete answer, that's why, I will provide a few resources in the end which are relevant for the ideas I just mentioned. Also, there are hdf5 and zarr which you can use as an work around for this process.

Algorithmic idea of what I told:
(1) start
(2) parse through files 1 to 10: add them, store as mmap referred object in hard disk.
for all n from 1 to 100:
(3) parse n*10+1 to (n+1)*10 files: add them in memory, append them to the mmap referred object in hard disk.
(4) you got your numpy array stored in memory.
References:
(1) mmap vs zarr vs hdf5
(2) numpy mmap support
(3) what is mmap actually

Building a Numpy array by appending data (without knowing the full size in advance)

3 Answers3

Edit to address the edited question:

Linked