I have approximately 5k raw data input files and 15k raw data test files, several GB in total. Since those are raw data files, I had to process them iteratively in Matlab in order to obtain features that I want to train my actual classifier on (CNN). As a result, I produced one HDF5 mat
file for each of the raw data files. I developed my model locally with usage of Keras and modified DirectoryIterator
in which I had something like
for i, j in enumerate(batch_index_array):
arr = np.array(h5py.File(os.path.join(self.directory, self.filenames[j]), "r").get(self.variable))
# process them further
Files structure is
|
|--train
| |--Class1
| | |-- 2,5k .mat files
| |
| |--Class2
| |-- 2,5k .mat files
|--eval
| |--Class1
| | |-- 2k .mat files
| |
| |--Class2
| |-- 13k .mat files
This is the files structure that I have right now in my Google ML Storage bucket. It was working locally with python with a small model but now I'd like to utilize Google ML hyper params tuning capabilities since my model is a lot bigger. The problem is that I read on the Internet that HDF5 files cannot be read directly and easily from the Google ML Storage. I tried to modify my script like this:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
for i, j in enumerate(batch_index_array):
with file_io.FileIO((os.path.join(self.directory, self.filenames[j], mode='r') as input_f:
arr = np.array(h5py.File(input_f.read(), "r").get(self.variable))
# process them further
but it's giving me the error similar to this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte just with other hex and positon 512.
I also had something like this:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
for i, j in enumerate(batch_index_array):
with file_io.FileIO((os.path.join(self.directory, self.filenames[j], mode='rb') as input_f:
arr = np.fromstring(input_f.read())
# process them further
but it also doesn't work.
Question
How can I modify my script to be able to read those HDF5 files in the Google ML? I'm aware of the data pickling practice but the thing is that loading to memory a pickle created from 15k files (several GB) seems not very efficient.