1

In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?

In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?

and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.

some related topics: Chain datasets from multiple HDF5 files/datasets

Community
  • 1
  • 1
Wang
  • 7,250
  • 4
  • 35
  • 66

2 Answers2

1

I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:

http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Isn't the link is about several datasets in a single hdf5 file? I would like to be able to view the datasets in many different files as a single one. Can you point me a direction where can I find this info? – Wang Sep 16 '16 at 15:05
  • I'm suggesting putting the entire array into a single HDF5 file. HDF5 will then handle chunking the data into small blocks of disk space so you can efficiently access the array out-of-core. Unless you are somehow constrained that the system producing the array is already writing to many small files. – JoshAdel Sep 16 '16 at 15:18
  • Well, One big file is not an option. It needs to be in many small files. I would like to avoid copy TB level data. – Wang Sep 16 '16 at 15:21
1

You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.

TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
  • 1
    While this is a good suggestion at the outset, I was not convinced to pursue this due to a rather important limitation: dask arrays cannot be indexed by other arrays. See http://dask.pydata.org/en/latest/array-overview.html#limitations Dask.array does not support any operation where the resulting shape depends on the values of the array. In order to form the dask graph we must be able to infer the shape of the array before actually executing the operation. This precludes operations like indexing one dask array with another or operations like np.where. – achennu Sep 17 '16 at 13:23