0

I have a PyTables file with a considerable amount of subdirectories. I have a way of iterating through all the array datatypes in the table. They are float64; I want to convert the file in place while converting all data points from float64 to float32.

According to this question, one way to overwrite arrays is to assign values. I have the following code snippet which tries to take this "count" value/array in the table, converts it to float32, and assigns it back to the table:

import h5py
import numpy as np

# filehead is a string for a file
with h5py.File(filehead, 'r+') as f:
    # Lots of stuff here ... e.g. `head` is a string

    print("/obsnorm/Standardizer/count {}".format(f[head+'/obsnorm/Standardizer/count']))
    print("count value: {}".format(f[head+'/obsnorm/Standardizer/count'].value))
    f[head+'/obsnorm/Standardizer/count'][...] = (f[head+'/obsnorm/Standardizer/count'].value).astype('float32')
    print("/obsnorm/Standardizer/count {}".format(f[head+'/obsnorm/Standardizer/count']))
    print("count value: {}".format(f[head+'/obsnorm/Standardizer/count'].value))

Unfortunately, the result of the printing is:

/obsnorm/Standardizer/count <HDF5 dataset "count": shape (), type "<f8">
count value: 512364.0
/obsnorm/Standardizer/count <HDF5 dataset "count": shape (), type "<f8">
count value: 512364.0

In other words, before the assignment, the type of count is f8, or float64. After casting it, the type is still float64.

How do I modify this in-place so that the data is truly understood as float32?

ComputerScientist
  • 936
  • 4
  • 12
  • 20
  • Your linked SO makes it clear that you can only overwrite the data if the shape is the same; otherwise you need to make a new dataset. The same applies to `dtype`. To change `dtype` you have make a new dataset, either in this file or a new one. – hpaulj Jul 04 '17 at 17:46
  • I see, that makes sense. I suppose I was hoping that h5py had an internal method which could do this conversion. But I can figure out how to make the new dataset. – ComputerScientist Jul 04 '17 at 17:47
  • You could explore the base `HDF5` code and documentation (C++ and so on). – hpaulj Jul 04 '17 at 17:48

1 Answers1

1

As suggested by hpaulj in the comments, I decided to simply recreate a duplicate HDF5 file except making the datasets of type f4 (same as float32) and I was able to achieve my coding goals.

The pseudocode is as follows:

import h5py
import numpy as np

# Open the original file jointly with new file, with `float32` at the end.
with h5py.File(oldfile, 'r') as f, h5py.File(newfile[:-3]+'_float32.h5', 'w') as newf:
    # `head` is some directory structure
    # Create groups to follow the same directory structure
    newf.create_group(head)

    # When it comes time to create a dataset, make the cast here.
    newdata = (f[head+'/name_here'].value).astype('float32')
    newf.create_dataset(head+'/name_here', data=newdata, dtype='f4')

    # Proceed for all other datasets.
ComputerScientist
  • 936
  • 4
  • 12
  • 20