There are at least 2 ways to access multiple files:
- If all files follow a naming pattern, you can use the
glob
module. It uses wildcards to find files. (Note: I prefer
glob.iglob
; it is an iterator that yields values without creating a list. glob.glob
creates a list which you frequently don't need.)
- Alternatively, you could input a list of filenames and loop on
the list.
Example of iglob:
import glob
for fname in glob.iglob('img_data_0?.h5'):
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Example with a list of names:
filenames = [ 'img_data_01.h5', 'img_data_02.h5', 'img_data_03.h5' ]
for fname in filenames:
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Next, your code mentions using [:]
when you access a dataset. Whether or not you need to add indices depends on the object you want returned.
- If you include
[()]
, it returns the entire dataset as a numpy array. Note [()]
is now preferred over [:]
. You can use any valid slice notation, e.g., [0,0,:]
for a slice of a 3-axis array.
- If you don't include
[:]
, it returns a h5py dataset object, which
behaves like a numpy array. (For example, you can get dtype and shape, and slice the data). The advantage? It has a smaller memory footprint. I use h5py dataset objects unless I specifically need an array (for example, passing image data to another package).
Examples of each method:
jets_dset = h5f['jets'] # w/out [()] returns a h5py dataset object
jets_arr = h5f['jets'][()] # with [()] returns a numpy array object
Finally, if you want to create a single array that merges values from 3 datasets, you have to create an array big enough to hold the data, then load with slice notation. Alternatively, you can use np.concatenate()
(However, be careful, as concatenating a lot of data can be slow.)
A simple example is shown below. It assumes you know the shape of the dataset, and they are the same for all 3 files. (a0, a1
are the axes lengths for 1 dataset) If you don't know them, you can get them from the .shape
attribute
Example for method 1 (pre-allocating array jets3x_arr
):
a0, a1 = 100, 100
jets3x_arr = np.empty(shape=(a0, a1, 3)) # add dtype= if not float
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
jets3x_arr[:,:,cnt] = h5f['jets']
Example for method 2 (using np.concatenate()
):
a0, a1 = 100, 100
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
if cnt == 0:
jets3x_arr= h5f['jets'][()].reshape(a0,a1,1)
else:
jets3x_arr= np.concatenate(\
(jets3x_arr, h5f['jets'][()].reshape(a0,a1,1)), axis=2)