I am building simulation software, and I need to write (thousands of) 2D numpy arrays into tables in an HDF5 file, where one dimension of the array is variable. The incoming array
is of float32 type; to save disk space every array is stored as a table with appropriate data-types for the columns (hence not using arrays). When I read tables, I'd like to retrieve a numpy.ndarray of type float32, so I can do nice calculations for analysis. Below is example code with an array with species A,B, and C plus time.
The way I am currently reading and writing 'works' but it is very slow. The question is thus: what is the appropriate way of storing array
into table
fast, and also reading it back again into ndarrays? I have been experimenting with numpy.recarray, but I cannot get this to work (type errors, dimension errors, wholly wrong numbers etc.)?
Code:
import tables as pt
import numpy as np
# Variable dimension
var_dim=100
# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32)
filename='test.hdf5'
hdf=pt.open_file(filename=filename,mode='w')
group=hdf.create_group(hdf.root,"group")
particle={
'A':pt.Float32Col(),
'B':pt.UInt16Col(),
'C':pt.UInt16Col(),
'time':pt.Float32Col(),
}
dtypes=np.array([
np.float32,
np.uint16,
np.uint16,
np.float32
])
# This is the table to be stored in
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim)
# My current way of storing
for i, row in enumerate(array.T):
table.append([tuple([t(x) for t, x in zip(dtypes, row)])])
table.flush()
hdf.close()
hdf=pt.open_file(filename=filename,mode='r')
array_table=hdf.root.group._f_iter_nodes().__next__()
# My current way of reading
row_list = []
for i, row in enumerate(array_table.read()):
row_list.append(np.array(list(row)))
#The retreived array
array=np.asarray(row_list).T
# I've tried something with a recarray
rec_array=array_table.read().view(type=np.recarray)
# This gives me errors, or wrong results
rec_array.view(dtype=np.float64)
hdf.close()
The error I get:
Traceback (most recent call last):
File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__
ret = object.__setattr__(self, attr, val)
ValueError: new type not compatible with array.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/thomas/Documents/Thesis/SO.py", line 53, in <module>
rec_array.view(dtype=np.float64)
File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__
raise exctype(value)
ValueError: new type not compatible with array.
Closing remaining open files:test.hdf5...done