1

I'm trying to load data from a virtual HDF dataset created with h5py and having some troubles properly loading the data.

Here is an example of my issue:

import h5py
import tools as ut

virtual  = h5py.File(ut.params.paths.virtual)

a = virtual['part2/index'][:]

print(virtual['part2/index'][-1])
print(a[-1])

This outputs:

[890176134]
[0]

Why? Why is the last element different when I copy the data into a NumPy array (value=[0]) vs when I read directly from the dataset (value=[890176134])?

Am I doing something horribly wrong without realizing it?

Thanks a lot.

kcw78
  • 7,131
  • 3
  • 12
  • 44
pnjun
  • 119
  • 7

1 Answers1

1

Yes, you should get the same values from the Virtual Dataset or an array created from the Virtual Dataset. It's hard to diagnose the error without more details about the data.

I used the h5py example vds_simple.py to demonstrate how this should behave. Most of the code builds the HDF5 files. The section at end the compares the output. Code below modified from the example to create a variable number of source files (defined by a0=).

Code to create the 'a0' source files with sample data:

a0 = 5000
# create sample data
data = np.arange(0, 100).reshape(1, 100)

# Create source files (0.h5 to a0.h5)
for n in range(a0):
    with h5py.File(f"{n}.h5", "w") as f:
        row_data = data + n
        f.create_dataset("data", data=row_data)

Code to define the virtual layout and assemble virtual dataset:

# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(a0, 100), dtype="i4")
for n in range(a0):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, "data", shape=(100,))
    layout[n] = vsource

# Add virtual dataset to output file
with h5py.File("VDS.h5", "w", libver="latest") as f:
    f.create_virtual_dataset("vdata", layout)

Code to read and print the data:

# read data back
# virtual dataset is transparent for reader!
with h5py.File("VDS.h5", "r") as f:
    arr = f["vdata"][:]

    print("\nFirst 10 Elements in First Row:")
    print("Virtual dataset:")
    print(f["vdata"][0, :10])
    print("Reading vdata into Array:")
    print(arr[0, :10])

    print("Last 10 Elements of Last Row:")
    print("Virtual dataset:")
    print(f["vdata"][-1,-10:])
    print("Reading vdata into Array:")
    print(arr[-1,-10:])    

Output from code above (w/ a0=5000):

First 10 Elements in First Row:
Virtual dataset:
[0 1 2 3 4 5 6 7 8 9]
Reading vdata into Array:
[0 1 2 3 4 5 6 7 8 9]
Last 10 Elements of Last Row:
Virtual dataset:
[5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
Reading vdata into Array:
[5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Hi, thanks for the support. I guess I should bring this up as a issue with the h5py team. If you confirm the intended behavior is that I should get the same value, maybe there is a bug in the h5py implementation. I'm using an unusually large (5000+) number of virtual sources to construct the dataset, and the problem seems to appear only towards the end of the dataset. – pnjun Nov 13 '21 at 20:18
  • If you think the problem is the number of sources, compare values for other slices: `virtual['part2/index'][0]` vs `arr[0]` then `virtual['part2/index'][100]` vs `arr[100]`, etc – kcw78 Nov 14 '21 at 19:23
  • Yep, that's what I did (even before asking the question). The values at the beginning of the array work fine. However there is not a clear point at which it stops working... for some indexes it works also towards the end of the array. – pnjun Nov 15 '21 at 08:34
  • I parameterized the code above to create 5000 files and assemble in a virtual dataset. It works. I will update so you can do more testing. At this point if you can't isolate the error, you need to ask the h5py team. Did you know **The HDF Group** has a h5py specific forum? It's at [HDF Forum](https://forum.hdfgroup.org/) Look under the **HDF5 Ancillary Tools** topic. To post a question, you need to create a login. – kcw78 Nov 15 '21 at 17:26