How to combine two huge numpy arrays without concat, stack, or append?

Question

I have two numpy arrays of huge size. Each array has the shape of (7, 960000, 200). I want to concatenate them using np.concatenate((arr1, arr2), axis=1) so that the final shape would be (7, 1920000, 200). The problem is, they already filled up my ram, and there is no enough room in the ram to do the concatenation operation, hence, the execution is killed. Same thing for the np.stack. So, I thought of making a new array which points to the two arrays in order, and this new array should have the same effect as combining the arrays; they should be contiguous as well.

So, how to do so? And, is there a better way to combining them than the idea I suggested?

Does this answer your question? [Concatenate Numpy arrays without copying](https://stackoverflow.com/questions/7869095/concatenate-numpy-arrays-without-copying) — ti7, May 24 '22 at 17:38
This isn't really possible. Arrays are stored in single contiguous blocks of memory, and you would have to define a whole new class if you wanted to perform operations on a list of two arrays (and it would defeat the purpose of an array to be indexed very efficiently). Like in the question linked in the other comment, preallocating is the best solution if possible. — Zorgoth, May 24 '22 at 17:38
`np.stack` uses `concatenate`; it just tweaks the dimensions. Same for the other `stacks`. An array that 'points' to other arrays must be `object` dtype, and is essentially a list. They won't be contiquous. — hpaulj, May 24 '22 at 17:39
You mention one solution. The only alternative is to play with virtual memory for example by memory-map the array to a storage device so the array do not fit in RAM anymore. Note that this can be much slower than working in RAM especially for non-contiguous accesses or on HDDs. — Jérôme Richard, May 24 '22 at 18:35

Hersh Joshi · Accepted Answer · 2022-05-27T22:10:18.817

1

Numpy numpy.memmap() allows for the creation of memory mapped data stored as a binary on disk that can be accessed and interfaced with as if it were a single array. This solution saves the individual arrays you are working with as separate .npy files and then combines them into a single binary file.

import numpy as np
import os

size = (7,960000,200)

# We are assuming arrays a and b share the same shape, if they do not 
# see https://stackoverflow.com/questions/50746704/how-to-merge-very-large-numpy-arrays
# for an explanation on how to create the new shape

a = np.ones(size) # uses ~16 GB RAM
a = np.transpose(a, (1,0,2))
shape = a.shape
shape[0] *= 2
dtype = a.dtype

np.save('a.npy', a)
a = None # allows for data to be deallocated by garbage collector

b = np.ones(size) # uses ~16 GB RAM
b = np.transpose(b, (1,0,2))
np.save('b.npy', a)
b = None

# Once the size is know create memmap and write chunks
data_files = ['a.npy', 'b.npy']
merged = np.memmap('merged.dat', dtype=dtype, mode='w+', shape=shape)
i = 0
for file in data_files:
    chunk = np.load(file, allow_pickle=True)
    merged[i:i+len(chunk)] = chunk
    i += len(chunk)

merged = np.transpose(merged, (1,0,2))

# Delete temporary numpy .npy files
os.remove('a.npy')
os.remove('b.npy')

Based on: this stackoverflow answer
also check out hdf5 and combining two hdf5 files here. It's another good way of storing large datasets

edited May 27 '22 at 22:10

answered May 24 '22 at 18:49

Hersh Joshi

419
3
13

Thank for your reply. I like the idea. I tried it but I have some questions/problems with it. For the questions: 1) Arrays a and b are the arrays I want to store, right? If so, I don't need to set new arrays or give them size, since my arrays are already there and have my values? Do I get that right? 2) The np.memmap initializes the memorymap, but it is empty until the loops give it the values in the npy files, right? – Peter Ragheb May 24 '22 at 22:43
1

yeah just use arr1 instead of a and arr2 instead of b. Also yeah np.memmap() isnt usable until you set it with the contents of arr1 and arr2, but afterwards its basically a massive array with the contents of arr1 and arr2 inside of it – Hersh Joshi May 24 '22 at 22:50
1

I just created two 16 GB arrays a and b to test it out on my system, you wouldn't use them or set the size yourself since you already have your arrays – Hersh Joshi May 24 '22 at 22:51
For the problems: Both arrays have the shape of (7,960000,200), and it was mentioned that the shape of the memmap should be the final shape. The shape required is (7,1920000,200) which is the shape you would get when concatenating them on axis=1. But you have put the shape of 1 of them, not the final. Regardless of which shape I choose, I get errors. If I choose the shape of one of them, I get "could not broadcast input array from shape (7,960000,200) into shape (0,960000,200)", and when using final shape "could not broadcast input array from shape (7,960000,200) into shape (7,1920000,200)" – Peter Ragheb May 24 '22 at 22:53
And final 2 questions. Will it combine both arrays as if I concatinated them? And, the merged variable now is the variable I use when manipulating the array? and can I use any np operations (using predefined methods) on it as if it was loaded to memory? – Peter Ragheb May 24 '22 at 22:58
Use `np.tranpose(arr1, (1,0,2))` and `np.tranpose(arr2, (1,0,2))` at the start of the code to swap the axes and then you'll be able to concatenate them along the first axis. I also missed added the line `shape[0] *= 2` Once that is done, if you need merged in the original shape you can use `np.transpose(merged, (1,0,2))`; this tranpose will be a very costly operation however and I'd avoid the final transpose as a shape of (1920000,7,200) makes more sense – Hersh Joshi May 25 '22 at 16:04

How to combine two huge numpy arrays without concat, stack, or append?

1 Answers1