h5py error reading virtual dataset into NumPy array

Question

I'm trying to load data from a virtual HDF dataset created with h5py and having some troubles properly loading the data.

Here is an example of my issue:

import h5py
import tools as ut

virtual  = h5py.File(ut.params.paths.virtual)

a = virtual['part2/index'][:]

print(virtual['part2/index'][-1])
print(a[-1])

This outputs:

[890176134]
[0]

Why? Why is the last element different when I copy the data into a NumPy array (value=[0]) vs when I read directly from the dataset (value=[890176134])?

Am I doing something horribly wrong without realizing it?

Thanks a lot.

kcw78 · Accepted Answer · 2021-11-15T17:35:54.370

Yes, you should get the same values from the Virtual Dataset or an array created from the Virtual Dataset. It's hard to diagnose the error without more details about the data.

I used the h5py example vds_simple.py to demonstrate how this should behave. Most of the code builds the HDF5 files. The section at end the compares the output. Code below modified from the example to create a variable number of source files (defined by a0=).

Code to create the 'a0' source files with sample data:

a0 = 5000
# create sample data
data = np.arange(0, 100).reshape(1, 100)

# Create source files (0.h5 to a0.h5)
for n in range(a0):
    with h5py.File(f"{n}.h5", "w") as f:
        row_data = data + n
        f.create_dataset("data", data=row_data)

Code to define the virtual layout and assemble virtual dataset:

# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(a0, 100), dtype="i4")
for n in range(a0):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, "data", shape=(100,))
    layout[n] = vsource

# Add virtual dataset to output file
with h5py.File("VDS.h5", "w", libver="latest") as f:
    f.create_virtual_dataset("vdata", layout)

Code to read and print the data:

# read data back
# virtual dataset is transparent for reader!
with h5py.File("VDS.h5", "r") as f:
    arr = f["vdata"][:]

    print("\nFirst 10 Elements in First Row:")
    print("Virtual dataset:")
    print(f["vdata"][0, :10])
    print("Reading vdata into Array:")
    print(arr[0, :10])

    print("Last 10 Elements of Last Row:")
    print("Virtual dataset:")
    print(f["vdata"][-1,-10:])
    print("Reading vdata into Array:")
    print(arr[-1,-10:])

Output from code above (w/ a0=5000):

First 10 Elements in First Row:
Virtual dataset:
[0 1 2 3 4 5 6 7 8 9]
Reading vdata into Array:
[0 1 2 3 4 5 6 7 8 9]
Last 10 Elements of Last Row:
Virtual dataset:
[5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
Reading vdata into Array:
[5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]

Hi, thanks for the support. I guess I should bring this up as a issue with the h5py team. If you confirm the intended behavior is that I should get the same value, maybe there is a bug in the h5py implementation. I'm using an unusually large (5000+) number of virtual sources to construct the dataset, and the problem seems to appear only towards the end of the dataset. — pnjun, Nov 13 '21 at 20:18
If you think the problem is the number of sources, compare values for other slices: `virtual['part2/index'][0]` vs `arr[0]` then `virtual['part2/index'][100]` vs `arr[100]`, etc — kcw78, Nov 14 '21 at 19:23
Yep, that's what I did (even before asking the question). The values at the beginning of the array work fine. However there is not a clear point at which it stops working... for some indexes it works also towards the end of the array. — pnjun, Nov 15 '21 at 08:34
I parameterized the code above to create 5000 files and assemble in a virtual dataset. It works. I will update so you can do more testing. At this point if you can't isolate the error, you need to ask the h5py team. Did you know **The HDF Group** has a h5py specific forum? It's at [HDF Forum](https://forum.hdfgroup.org/) Look under the **HDF5 Ancillary Tools** topic. To post a question, you need to create a login. — kcw78, Nov 15 '21 at 17:26

h5py error reading virtual dataset into NumPy array

1 Answers1

Linked