How to load numpy npz files in google-cloud-ml jobs or from Google Cloud Storage?

Question

I have a google-cloud-ml job that requires loading numpy .npz files from gs bucket. I followed this example on how to load .npy files from gs, but it didn't work for me since .npz files are compressed.

Here's my code:

from StringIO import StringIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io

f = StringIO(file_io.read_file_to_string('gs://my-bucket/data.npz'))
data = np.load(f)

And here's the error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 10: invalid start byte

Apparently, encoding the data to str is not correct, but I'm not sure how to address this.

Can some one help? Thanks!

score 5 · Accepted Answer · answered Jun 20 '17 at 18:25

It turns out I need to set the binary flag to True in file_io.read_file_to_string().

Here's the working code:

from io import BytesIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io

f = BytesIO(file_io.read_file_to_string('gs://my-bucket/data.npz', binary_mode=True))
data = np.load(f)

And this works for both compressed and uncompressed .npz files.

rhaertel80 · Answer 2 · 2017-06-22T19:00:06.423

1

Try using io.BytesIO instead, which has the added bonus of being forwards-compatible with Python 3:

import io
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io

f = io.BytesIO(file_io.read_file_to_string('gs://my-bucket/data.npz'),
               binary_mode=True)
data = np.load(f)

edited Jun 22 '17 at 19:00

answered Jun 20 '17 at 16:34

rhaertel80

8,254
1
31
47

Tried it, but still didn't work -- got the same error message. Thanks though! – astromz Jun 20 '17 at 16:54
setting `binary_mode=True` in `read_file_to_string` then your code works. Thanks. – astromz Jun 20 '17 at 18:27
I just edited the code, thanks. Strange though, it ran fine on the test I did, but this looks better anyways. – rhaertel80 Jun 22 '17 at 19:01

score 1 · Answer 3 · answered Dec 19 '17 at 19:14

An alternative is (note the difference between earlier TF versions and later ones):

import numpy as np
from tensorflow.python.lib.io import file_io
from tensorflow import __version__ as tf_version

if tf_version >= '1.1.0':
    mode = 'rb'
else: # for TF version 1.0
    mode = 'r'

f_stream = file_io.FileIO('mydata.npz', mode)
d = np.load( BytesIO(f_stream.read()) )

Similarly, for pickle files:

import pickle
d = pickle.load(file_io.FileIO('mydata.pickle', mode))

How to load numpy npz files in google-cloud-ml jobs or from Google Cloud Storage?

3 Answers3