0

I have a h5py file, which contains information about a dataset. There are n items inside the dataset and k keys. For example, for each item I have stored a value for the keys bbox, number_keypoints, etc.As the dataset is too huge for me, I want to randomly sample from the dataset and create a smaller h5py or json file.

Let's say, I want to sample items [1, 6, 16]. Then, I want to practically take these indices for all keys (I hope it is clear, what I am trying to do).

Here is, what my idea looks like:

import h5py
with h5py.File(my_file, "r") as f:
    arr = [1, 6, 16]
    f = {key: value for i, (key, value) in enumerate(f.items()) if i in arr}

Unfortunately, this doesn't work. Can anyone help me here?

spadel
  • 998
  • 2
  • 16
  • 40

3 Answers3

1

You can use what's called fancy indexing in the h5py guide:

Let's say you have a data set ds with numbers from 1 to 10 of which you want to take indices specified by arr =[2,4,5]. you can get the subset using sub_ds = ds[arr] will get you an array of length 3 containing the values of the desired indices in arr.

if you have an array of keys called keys (you can use f.keys() only if you don't have groups under your root, only datasets. otherwise you'll get an error), to get what you want you can modify your code to:

import h5py
with h5py.File(my_file, "r") as f:
    arr = [1, 6, 16]
    f_subset = {key: f[key][arr] for key in keys}
kcw78
  • 7,131
  • 3
  • 12
  • 44
Assaf
  • 316
  • 2
  • 6
  • Thanks a lot, this is better! :) I also didn't know that this type of indexing is supported, thats nice – spadel Jan 27 '21 at 20:02
  • @spadel, note that the 2 methods return different object types for the dictionary value. With Assaf's method they are (numpy) array slices, and your method returns value as a list. Depending on what you want to do with these values, this may or may not matter in downstream operations. Also, as Assaf mentioned, this only works if all objects at the file's root level are datasets (not groups). It also assumes all datasets have the same shape (and sufficient size for the indices in your list). – kcw78 Jan 27 '21 at 23:30
  • @kcw78 you are absolutely right, both data structures are not interchangeable, need different functions in order to manipulate and serialize. you can, however convert numpy arrays to lists and vise-versa quite easily. BTW if you want to find all the dataset keys in a hdf5 file, check out this thread https://stackoverflow.com/questions/44883175/how-to-list-all-datasets-in-h5py-file/65924963#65924963 I've also added my own answer there that does this with a simple function. – Assaf Jan 28 '21 at 07:48
  • @Assaf, yes I'm familiar with h5py `.visit()` and .`visititems()` methods. I have a couple of posts using them. I suspect @spadel's use case is model specific, so doesn't have to worry about issues I mention. For those that work with more general data schema, I added an answer to this thread to show the logic to test for datasets and their shape attribute. – kcw78 Jan 28 '21 at 20:11
0

Sorry guys, I just figured it out myself:

import h5py
with h5py.File(my_file, "r") as f:
    arr = [1, 6, 16]
    f_subset = {key: [value for i, value in enumerate(list(f[key])) if i in arr] for key in f.keys()}

This is doing what I want :)

spadel
  • 998
  • 2
  • 16
  • 40
  • okay this approach is WAY too slow - there are over 2 million items per key. Does anyone know how to speed it up? – spadel Jan 27 '21 at 17:06
0

As mentioned in comments above, previous answers only work if all objects at the file's root level are datasets (not groups) and the datasets have appropriate shape and size for the indices in the slice list. The code below shows logic to validate and process the nodes (keys) with datasets of appropriate shape.

import h5py
with h5py.File(my_file, "r") as h5f:
    arr = [1, 6, 16]
    f_subset = dict()
    for key in h5f.keys():
        if isinstance(h5f[key],h5py.Dataset):
            if len(h5f[key].shape) == 1 and h5f[key].shape[0] > max(arr):
                f_subset[key] = h5f[key][arr] 

All 3 code examples posted here only operate on datasets at the root level. If a recursive search is required, this process can be extended with the h5py .visit() and .visititems() methods to recursively find nodes (groups and datasets). There are other SO Answers that cover this.

kcw78
  • 7,131
  • 3
  • 12
  • 44