Reading and writing numpy arrays to and from HDF5 files

Question

I am building simulation software, and I need to write (thousands of) 2D numpy arrays into tables in an HDF5 file, where one dimension of the array is variable. The incoming array is of float32 type; to save disk space every array is stored as a table with appropriate data-types for the columns (hence not using arrays). When I read tables, I'd like to retrieve a numpy.ndarray of type float32, so I can do nice calculations for analysis. Below is example code with an array with species A,B, and C plus time.

The way I am currently reading and writing 'works' but it is very slow. The question is thus: what is the appropriate way of storing array into table fast, and also reading it back again into ndarrays? I have been experimenting with numpy.recarray, but I cannot get this to work (type errors, dimension errors, wholly wrong numbers etc.)?

Code:

import tables as pt
import numpy as np

# Variable dimension
var_dim=100

# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32)

filename='test.hdf5'
hdf=pt.open_file(filename=filename,mode='w')
group=hdf.create_group(hdf.root,"group")

particle={
    'A':pt.Float32Col(),
    'B':pt.UInt16Col(),
    'C':pt.UInt16Col(),
    'time':pt.Float32Col(),
    }
dtypes=np.array([
    np.float32,
    np.uint16,
    np.uint16,
    np.float32
    ])

# This is the table to be stored in
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim)

# My current way of storing
for i, row in enumerate(array.T):
    table.append([tuple([t(x) for t, x in zip(dtypes, row)])])
table.flush()
hdf.close()


hdf=pt.open_file(filename=filename,mode='r')
array_table=hdf.root.group._f_iter_nodes().__next__()

# My current way of reading
row_list = []
for i, row in enumerate(array_table.read()):
    row_list.append(np.array(list(row)))

#The retreived array
array=np.asarray(row_list).T


# I've tried something with a recarray
rec_array=array_table.read().view(type=np.recarray)

# This gives me errors, or wrong results
rec_array.view(dtype=np.float64)
hdf.close()

The error I get:

Traceback (most recent call last):
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__
    ret = object.__setattr__(self, attr, val)
ValueError: new type not compatible with array.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/thomas/Documents/Thesis/SO.py", line 53, in <module>
    rec_array.view(dtype=np.float64)
  File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__
    raise exctype(value)
ValueError: new type not compatible with array.
Closing remaining open files:test.hdf5...done

It might help to see the shape and dtype of `array` (from the first `asarray`). I'm guessing it's already a structured array. Or the similar info for the `recarray` version. — hpaulj, Apr 26 '17 at 17:04
Is using tables the only solution which is possible for you? How do you acces your data afterwards (only the whole 2D Arrays or Subsets of them)? — max9111, Apr 26 '17 at 22:31
Are there also only few columns (in your example only four) in your real data? Is your data compressible? Is even lossy compression a possibility for you? https://computation.llnl.gov/projects/floating-point-compression/zfp-compression-ratio-and-quality — max9111, Apr 26 '17 at 22:55

score 2 · Accepted Answer · answered Apr 26 '17 at 14:09

2

As a quick and dirty solution it is possible to aviod loops by temporarily converting the arrays to lists (if you can spare the memory). For some reason record arrays are readily converted to/from lists but not to/from conventional arrays.

Storing:

table.append(array.T.tolist())

Loading:

loaded_array = np.array(array_table.read().tolist(), dtype=np.float64).T

There should be a more "Numpythonic" approach to convert between record arrays and conventional arrays, but I'm not familiar enough with the former to know how.

answered Apr 26 '17 at 14:09

MB-F

22,770
4
61
116

That already makes the code more readable! I do still wonder what the Numpythonic way would be. Nevertheless, thank you! – Patrickens Apr 26 '17 at 14:38
@Patrickens I just found out about [np.core.records.fromarrays](https://docs.scipy.org/doc/numpy/reference/generated/numpy.core.records.fromarrays.html), but I don't think it provides any benefit in this case. Internally it converts the array to a list of arrays which is less efficient than `.tolist()` and it requires more arguments. Maybe my approach is not so Un-Numpythonic after all :) – MB-F Apr 26 '17 at 15:13
In limited cases `view` or `astype` can be used to convert structured arrays to numeric ones, but this `tolist` intermediary is the most general means. Note that going the other way requires converting a list of lists into a list of tuples. An alternative is to copy fields by name. Since the number of records usually is large compared to the number of fields, you don't loose much by this iteration. – hpaulj Apr 26 '17 at 17:07
http://stackoverflow.com/questions/39502461/truly-recursive-tolist-for-numpy-structured-arrays is an example where even `tolist` is not general enough. `from numpy.lib import recfunctions` is another set of structured/recarray functions. Those favor recursively copying fields by name. – hpaulj Apr 26 '17 at 17:34

hpaulj · Answer 2 · 2017-04-26T20:07:10.073

I haven't worked with tables, but have looked at its files with h5py. I'm guessing then that your array or recarray is a structured array with dtype like:

In [131]: dt=np.dtype('f4,u2,u2,f4')
In [132]: np.array(arr.tolist(), float)
Out[132]: 
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
In [133]: arr
Out[133]: 
array([( 1., 1, 1,  1.), ( 1., 1, 1,  1.), ( 1., 1, 1,  1.)], 
      dtype=[('f0', '<f4'), ('f1', '<u2'), ('f2', '<u2'), ('f3', '<f4')])

Using @kazemakase's tolist approach (which I've recommended in other posts):

In [134]: np.array(arr.tolist(), float)
Out[134]: 
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

astype gets the shape all wrong

In [135]: arr.astype(np.float32)
Out[135]: array([ 1.,  1.,  1.], dtype=float32)

view works when component dtypes are uniform, for example with the 2 float fields

In [136]: arr[['f0','f3']].copy().view(np.float32)
Out[136]: array([ 1.,  1.,  1.,  1.,  1.,  1.], dtype=float32)

But it does require a reshape. view uses the databuffer bytes, just reinterpreting.

Many recfunctions functions use a field by field copy. Here the equivalent would be

In [138]: res = np.empty((3,4),'float32')
In [139]: for i in range(4):
     ...:     res[:,i] = arr[arr.dtype.names[i]]
     ...:     
In [140]: res
Out[140]: 
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]], dtype=float32)

If the number of fields is small compared to the number of records, this iteration is not expensive.

def foo(arr):
    res = np.empty((arr.shape[0],4), np.float32)
    for i in range(4):
        res[:,i] = arr[arr.dtype.names[i]]
    return res

With a large 4 field array, the by-field copy is clearly faster:

In [143]: arr = np.ones(10000, dtype=dt)
In [149]: timeit x1 = foo(arr)
10000 loops, best of 3: 73.5 µs per loop
In [150]: timeit x2 = np.array(arr.tolist(), np.float32)
100 loops, best of 3: 11.9 ms per loop

Reading and writing numpy arrays to and from HDF5 files

2 Answers2