How to deal with large amount of HDF5 files in Google Cloud Machine Learning?

Question

I have approximately 5k raw data input files and 15k raw data test files, several GB in total. Since those are raw data files, I had to process them iteratively in Matlab in order to obtain features that I want to train my actual classifier on (CNN). As a result, I produced one HDF5 mat file for each of the raw data files. I developed my model locally with usage of Keras and modified DirectoryIterator in which I had something like

for i, j in enumerate(batch_index_array):
            arr = np.array(h5py.File(os.path.join(self.directory, self.filenames[j]), "r").get(self.variable))
            # process them further

Files structure is

|  
|--train  
|    |--Class1
|    |    |-- 2,5k .mat files  
|    |      
|    |--Class2
|         |-- 2,5k .mat files  
|--eval  
|    |--Class1
|    |    |-- 2k .mat files  
|    |      
|    |--Class2
|         |-- 13k .mat files

This is the files structure that I have right now in my Google ML Storage bucket. It was working locally with python with a small model but now I'd like to utilize Google ML hyper params tuning capabilities since my model is a lot bigger. The problem is that I read on the Internet that HDF5 files cannot be read directly and easily from the Google ML Storage. I tried to modify my script like this:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

for i, j in enumerate(batch_index_array):
    with file_io.FileIO((os.path.join(self.directory, self.filenames[j], mode='r') as input_f:
        arr = np.array(h5py.File(input_f.read(), "r").get(self.variable))
        # process them further

but it's giving me the error similar to this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte just with other hex and positon 512.
I also had something like this:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

for i, j in enumerate(batch_index_array):
    with file_io.FileIO((os.path.join(self.directory, self.filenames[j], mode='rb') as input_f:
        arr = np.fromstring(input_f.read())
        # process them further

but it also doesn't work.

Question
How can I modify my script to be able to read those HDF5 files in the Google ML? I'm aware of the data pickling practice but the thing is that loading to memory a pickle created from 15k files (several GB) seems not very efficient.

try the read mode rb and not r. I guess r trys to interpret the data as string. — max9111, Feb 26 '18 at 17:52
That is what I tried, it worked to the certain extent but then some errors occurred, I might check it out once again. The problem is when I print the bytes string from the `mat` file, in the beginning, there's a line that describes the file properties, the Matlab version with which the file was created etc. — Colonder, Feb 26 '18 at 18:03
For mat Files use https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html This should be the easiest way to read it. Why have you created so much files instead of creating a simple dataset with all data in it? It will be slow to read, but not because of hdf5. Accesing data over the network has a high latency. Take also a look at https://stackoverflow.com/a/44961222/4045774 This will show the influence of the chunk size on I/O speed. Over network you will need a quite large chunk-size for good performance. — max9111, Feb 26 '18 at 18:14
I used the Matlab script that was created by someone else and I didn't really have a look what's going on inside — Colonder, Feb 26 '18 at 18:19
If you replace "arr = np.fromstring(input_f.read())" with "arr = scipy.io.loadmat(input_f.read())" really doesn't work? Try also "arr = scipy.io.loadmat(input_f)" — max9111, Feb 26 '18 at 19:06
With `r` mode it gives `'utf-8' codec can't decode byte 0x89 in position 512: invalid start byte` and with `rb` mode it gives `embedded null byte` — Colonder, Feb 26 '18 at 19:14

score 4 · Answer 1 · answered Feb 26 '18 at 15:43

HDF is a very common file format that, unfortunately, is not optimal in the cloud. For some explanations why, please see this blog post.

Given the inherent complexities of HDF on cloud, I recommend one of the following:

Convert your data to another file format such as CSV, or TFRecord of tf.Example
Copy the data locally to /tmp

Conversion can be inconvenient at best, and, for some datasets, perhaps gymnastics will be necessary. A cursory search on the internet revealed multiple tutorials on how to do so. Here's one you might refer to.

Likewise, there are multiple ways to copy data on to the local machine, but beware that your job won't start doing any actual training until the data is copied. Also, should one of the workers dies, it will have to recopy all of the data when it starts up again. If the master dies and you are doing distributed training, this can cause a lot of work to be lost.

That said, if you feel this is a viable approach in your case (e.g., you're not doing distributed training and/or you're willing to wait for the data transfer as described above), just start your Python with something like:

import os
import subprocess

if os.environ.get('TFCONFIG', {}).get('task', {}).get('type') != 'ps':
  subprocess.check_call(['mkdir', '/tmp/my_files'])
  subprocess.check_call(['gsutil', '-m', 'cp', '-r', 'gs://my/bucket/my_subdir', '/tmp/myfiles'])

score 0 · Answer 2 · answered Feb 26 '18 at 19:50

0

Reading data from an temporary file-like object

I do not have direct access to Google ML so I have to apologize if this answer doesn't work. I did something similiar to directly read h5-files from a zipped folder, but I hope this will work here to.

from scipy import io
import numpy as np
from io import BytesIO

#Creating a Testfile
Array=np.random.rand(10,10,10)
d = {"Array":Array}
io.savemat("Test.mat",d)

#Reading the data using a in memory file-like object
with open('Test.mat', mode='rb') as input_f:
    output = BytesIO()
    num_b=output.write(input_f.read())
    ab = io.loadmat(output)

answered Feb 26 '18 at 19:50

max9111

6,272
1
16
33

You don't have to have an access. Google ML use Tensorflow so technically the problem is "reading HDF5 files in Tensorflow". At least I guess so. Anyway, I think I'm going to create one huge CSV file instead of several thousand files, I think that'll be easier. – Colonder Feb 26 '18 at 20:10
You can't put a simple numpy-array to tensorflow? As described here https://stackoverflow.com/q/37620330/4045774 Or do you not have a normal way to access files? How do you access the csv files then? If so, I will give up here. Writing csv files (text-files) out of binary data and then parse it,is like printing a e-Book and scan it afterwards... – max9111 Feb 26 '18 at 20:20
I can but I can't read the HDF5 file in Google ML, but I can read CSV – Colonder Feb 26 '18 at 20:28
1

Then use the TFRecords file format. http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html Using a text file for data that is several GB in size have to be avoided, you won't be happy with that. It will blow up the overall size of the data and the read and write speed will be far from optimal. – max9111 Feb 26 '18 at 20:39
Ok, will check that – Colonder Feb 27 '18 at 09:38

How to deal with large amount of HDF5 files in Google Cloud Machine Learning?

2 Answers2