How to merge very large numpy arrays?

Question

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.

I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.

I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).

Arrays won't fit into RAM+swap memory.

How to merge them into an unique array and save it to a disk?

I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.

I have read this post but I still cannot realize how to do it.

EDIT

Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.

This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.

The way in which this particular array (the one I am trying to create) is built is not important while it works.

You can't mmap compressed arrays. I think the current `np.load` implementation just ignores `mmap_mode` if you try. — user2357112, Jun 07 '18 at 17:09
Must you merge them into a single array, or can you just load them chunk by chunk, process them chunk by chunk, and write them out chunk by chunk? — Linuxios, Jun 07 '18 at 17:16
@Linuxios, Thank you for your reply. I edited the question in order to answer yours. — user1420303, Jun 07 '18 at 17:32
@user1420303: I see. You can still used chunked data by looking at the time range, finding the corresponding chunks, and loading those, and slicing the first and last chunks if necessary. It's a little more logic, but it prevents you from running out of memory. You could even abstract it into some kind of streaming collection class allowing transparent array indexing and hiding that logic. — Linuxios, Jun 07 '18 at 17:36
@Linuxios, I bet yes. But, why not just save it to an uncompressed binary and load partially as required using memmap ? (as described in https://stackoverflow.com/questions/42727412/efficient-way-to-partially-read-large-numpy-file/42727761 ) — user1420303, Jun 07 '18 at 17:50
@user1420303: I was unaware of that feature. That seems reasonable, yes. I'm not entirely sure how it can be done, however. — Linuxios, Jun 07 '18 at 17:52
Memmap is very limited (only the fastest changing dimension can be loaded efficiently,all other are extremely slow). Something more general would be to use HDF5 (h5py) to save/load the data. (depending on the chunkshape, every dimension can be loaded efficiently and compresssion can also be used). To give a proper answer more information is needed (array sizes, how to merge them, is the data compressible, In which chunks (shape) do you want to process the data,...) eg. something like this to store the data https://stackoverflow.com/a/48997927/4045774 — max9111, Jun 07 '18 at 18:53
@max9111, Thank for your answer. The array size is case dependent. It is 2D-array, one index can vary from few tens to some thousands. The other index from about 50E3 to 1E7. The cunck size for reading chunks is PC dependent, from few tens to few hundred. "How to merge them" is exactly my problem — user1420303, Jun 07 '18 at 19:44
Precise informations on this things are really important. Something like my input arrays are of shape (10,1E3-1E7) my final array shoud have (50-500,1E3-1E7).. It would be good if you add an realistic example to your question. Reading speeds can go from few 100 KB/s to 100-1000 MB/s according to the write/read pattern and and the compression ratio. — max9111, Jun 07 '18 at 20:05
@max9111, I do not neglect the importance, but I cannot give precise info. because it is very case dependent. That is why I gave those wide margins. Having a great performance for the merging process is not too important (I can leave the PC working 1 or 2 days). In my actual case I have (15E5, 15E3) approx, divided in maybe 5 or 6 files. But I should work for other values too. — user1420303, Jun 07 '18 at 20:15

score 1 · Answer 1 · answered Jun 08 '18 at 11:47

This would be an example how to write a 90GB of easily compressible data to disk. The most important points are mentioned here https://stackoverflow.com/a/48405220/4045774

The write/read speed should be in the range of (300 MB/s,500MB/s) on a nomal HDD.

Example

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

def read_the_arrays():
  #Easily compressable data
  #A lot smaller than your actual array, I do not have that much RAM
  return np.arange(10*int(15E3)).reshape(10,int(15E3))

def writing(hdf5_path):
  # As we are writing whole chunks here this isn't realy needed,
  # if you forget to set a large enough chunk-cache-size when not writing or reading 
  # whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
  f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
  dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

  #Lets write to the dataset
  for i in range(0,int(15E5),10):
    dset[i:i+10,:]=read_the_arrays()

  f.close()

def reading(hdf5_path):
  f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
  dset = f["your_data"]

  #Read chunks
  for i in range(0,int(15E3),10):
    data=np.copy(dset[:,i:i+10])
  f.close()

hdf5_path='Test.h5'
t1=time.time()
writing(hdf5_path)
print(time.time()-t1)
t1=time.time()
reading(hdf5_path)
print(time.time()-t1)

Thank you. The write speed is just fine. I need to think about the code. I am not familiar with hdf5. Q: You do: 'dset = f.create_dataset' and then 'dset[i:i+10,:]=read_the_arrays()' many times. The whole array is never in RAM, rigth? — user1420303, Jun 08 '18 at 14:54
Yes, the read_the_arrays() function should simply imitate the reading process from your npz files. So the max. RAM usage should be the size of one InputArray+ the chunk-chache-size, which I have set to 1000MB. This can also be lower, but if you have too less cache, the performance will decrease drastically. — max9111, Jun 08 '18 at 15:07
Nice, that makes me simple to solve another problem (some time values are repeated in arrays, that is, there is a little of superposition). Do you think that the final .h5 file can be simply converted to .npz? — user1420303, Jun 08 '18 at 15:13
If repeated values occur within one chunk, it will be well handled by the compression algorithm. A conversion from a HDF5-file, which is to big to fit in memory to a compressed numpy file is possible, but not so straight forward. (Writing something in chunks to a zip file). — max9111, Jun 08 '18 at 15:31

jdehesa · Accepted Answer · 2018-06-08T15:34:36.013

1

You should be able to load chunk by chunk on a np.memap array:

import numpy as np

data_files = ['file1.npz', 'file2.npz2', ...]

# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        rows += chunk.shape[0]
        cols = chunk.shape[1]
        dtype = chunk.dtype

# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        merged[idx:idx + len(chunk)] = chunk
        idx += len(chunk)

However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.

edited Jun 08 '18 at 15:34

answered Jun 08 '18 at 12:44

jdehesa

58,456
7
77
121

Thank you for your answer. I gives me some ideas. I do not get how the code loads the multiple preexisting npz files. – user1420303 Jun 08 '18 at 14:44
@user1420303 With `data.iteritems` you go through the arrays in the file, and with `sorted(data.keys())` you go through the names in the array (I'm assuming they should be sorted alphabetically, but could be something else). – jdehesa Jun 08 '18 at 15:17
Right. As I understand the code, it reads 'one' npz file with multiple arrays insede, and merge them. I need to read 'many' npz files, each one containing one array, and merge them. – user1420303 Jun 08 '18 at 15:28
1

@user1420303 Ahh, I see, okay, I was not understanding correctly. I changed it now. – jdehesa Jun 08 '18 at 15:34

How to merge very large numpy arrays?

2 Answers2

Linked

Related