h5py file subset taking more space than parent file?

Question

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:

h5f = h5py.File('input.h5', 'r') 
data = h5f['data']
latlong_data = data['lat_long'].value

I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.

However, when i try to write it out to its own file:

out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()

The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?

kcw78 · Accepted Answer · 2022-06-07T17:24:41.907

Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:

print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)

After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).

There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.

If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes. See below:

with h5py.File('input.h5', 'r') as h5f1, \
     h5py.File('latlong_only.h5', 'w') as h5f2:

    h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

h5py file subset taking more space than parent file?

1 Answers1