120

I have a Python code whose output is a enter image description here sized matrix, whose entries are all of the type float. If I save it with the extension .dat the file size is of the order of 500 MB. I read that using h5py reduces the file size considerably. So, let's say I have the 2D numpy array named A. How do I save it to an h5py file? Also, how do I read the same file and put it as a numpy array in a different code, as I need to do manipulations with the array?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
lovespeed
  • 4,835
  • 15
  • 41
  • 54
  • 4
    How are you saving it with the `.dat` extension? – jorgeca Jan 05 '14 at 00:11
  • @jorgeca: for that I just do `np.savetxt("output.dat",A,'%10.8e')` – lovespeed Jan 05 '14 at 01:22
  • 3
    Thanks (the extension alone doesn't mean much, it could be stored as binary, ascii...). Unless you need the extra features of hdf5, I'd just use `np.save('output.dat', A)` which will save it in a binary format (much faster, much less space used). – jorgeca Jan 05 '14 at 01:52
  • @jorgeca but will another python script be able to read it as a 2D array when I call it as `A = np.loadtxt('output.dat',unpack=True)` – lovespeed Jan 05 '14 at 01:57
  • [Of course](http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html#numpy.load), just drop the `txt` and the unpack argument. – jorgeca Jan 05 '14 at 02:44
  • 2
    so `h5py` doesn't create files smaller than those `np.save` would? is `h5py` faster than `np.save` for arrays of the size given in the question? – abcd Apr 13 '15 at 23:48
  • @dbliss I doubt that h5py is faster. It either writes data out [uncompressed or gzipped](http://docs.h5py.org/en/latest/high/dataset.html#lossless-compression-filters) which is pretty standard. It just offers more comfort (attributes, slices, hierachies, links, ...). – NoDataDumpNoContribution Dec 21 '15 at 22:33

2 Answers2

153

h5py provides a model of datasets and groups. The former is basically arrays and the latter you can think of as directories. Each is named. You should look at the documentation for the API and examples:

http://docs.h5py.org/en/latest/quick.html

A simple example where you are creating all of the data upfront and just want to save it to an hdf5 file would look something like:

In [1]: import numpy as np
In [2]: import h5py
In [3]: a = np.random.random(size=(100,20))
In [4]: h5f = h5py.File('data.h5', 'w')
In [5]: h5f.create_dataset('dataset_1', data=a)
Out[5]: <HDF5 dataset "dataset_1": shape (100, 20), type "<f8">

In [6]: h5f.close()

You can then load that data back in using: '

In [10]: h5f = h5py.File('data.h5','r')
In [11]: b = h5f['dataset_1'][:]
In [12]: h5f.close()

In [13]: np.allclose(a,b)
Out[13]: True

Definitely check out the docs:

http://docs.h5py.org

Writing to hdf5 file depends either on h5py or pytables (each has a different python API that sits on top of the hdf5 file specification). You should also take a look at other simple binary formats provided by numpy natively such as np.save, np.savez etc:

http://docs.scipy.org/doc/numpy/reference/routines.io.html

gkcn
  • 1,360
  • 1
  • 12
  • 23
JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Btw. if you don't know the name of the dataset beforehand while reading you have to parse the hdf file similar to [here](http://stackoverflow.com/questions/34330283/how-to-differentiate-between-hdf5-datasets-and-groups-with-h5py). – NoDataDumpNoContribution Dec 21 '15 at 22:35
  • @JoshAdel if I want to add a column to the dataset. my dataset is a multidimensional np.array indexed as [img_id,rows,colums,channels]. and I have saved it using the method described in your answer. I access all the points in the dataset using h5f['dataset_1'][img_id]. what I want is a way to add another column say 'mycolumn' ...corresponding to every img_id in dataset. how should I add another column to this so I can do h5f['mycolumn'][img_id] ? – Irtaza May 06 '16 at 13:55
  • If I write matrices like this, then I cannot see them with HDFView 2.11 - I can open the file, I can see that the dataset `data.h5` exists, but I cannot view it with HDFView. I can read the contents with h5py, but not inspect it with HDFView. Any idea why? – Martin Thoma May 03 '19 at 09:41
137

A cleaner way to handle file open/close and avoid memory leaks:

Prep:

import numpy as np
import h5py

data_to_write = np.random.random(size=(100,20)) # or some such

Write:

with h5py.File('name-of-file.h5', 'w') as hf:
    hf.create_dataset("name-of-dataset",  data=data_to_write)

Read:

with h5py.File('name-of-file.h5', 'r') as hf:
    data = hf['name-of-dataset'][:]
Community
  • 1
  • 1
Lavi Avigdor
  • 4,092
  • 3
  • 25
  • 28
  • 2
    No need to close file? – daviddesancho Apr 05 '17 at 10:52
  • 26
    @DrDeSancho no, [the with statement](https://docs.python.org/2/reference/compound_stmts.html#the-with-statement) – Leonid Apr 06 '17 at 10:51
  • 1
    especially useful when running in interactive mode (because otherwise one risks to get an exception from h5py about an already open file when one reruns the same code without properly closing in the first attempt) – Andre Holzner Sep 21 '17 at 08:11
  • 1
    The `with` feature of Python is known as the context manager. It will make sure the file is closed after it has been used. More information is available in the official documentation: https://docs.python.org/3/library/contextlib.html – moo Jan 29 '20 at 18:50
  • To read scalar values, use `hf['name-of-scalar'][()] `, or you will get a `ValueError: Illegal slicing argument for scalar dataspace`. – MrCrHaM Jun 05 '23 at 12:18