0

I am trying to train a model on gcloud. I uploaded the data in a folder called Pokemon into a gs bucket.. This data does not need any labels because I am doing unsupervised learning. While running the code locally does work, when I try to train it on gcloud I have problems fetching the data properly.

This is my task code:

import tensorflow as tf 
import argparse
import numpy as np 
import trainer.model as model
from tensorflow.contrib.training.python.training import hparam


def run_experiment(hparams):
train_input = model.input_fn(hparams.train_dir)

# Transpose RGB channels into 3 different independent image
# Then flatted all pixel into one dimension
X_flat = np.transpose(train_input, (0,3,1,2))
X_flat = X_flat.reshape(2376, 1600)

print ('Original image shape:  {0}\nFlatted image shape:  {1}'.format(train_input.shape, X_flat.shape))

print ('Constructing model')

# tf Graph input (only pictures)
X = tf.placeholder("float", [None, model.n_input])
# Construct model
encoder_op = model.encoder(X)
variation_op, KLD, epsilon, layer_mu = model.variation(encoder_op)
decoder_op = model.decoder(variation_op)
# Prediction
y_pred = decoder_op
# Targets (labels) are the input data
y_true = X

# Define loss and optimizer
l2_loss = tf.add_n([tf.nn.l2_loss(model.weights[w]) for w in model.weights])
BCE = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_pred, labels=y_true), reduction_indices=1)

cost = tf.reduce_mean(BCE+KLD)+model.l2_lambda*l2_loss
optimizer = tf.train.RMSPropOptimizer(model.learning_rate).minimize(cost)

# Init variables
init = tf.global_variables_initializer()
# Create session and graph, init variables
sess = tf.InteractiveSession()
sess.run(init)
total_batch = int(X_flat.shape[0]/model.batch_size)
# Training cycle
for epoch in range(model.training_epochs):
    # Loop over all batches
    start = 0; end = model.batch_size
    for i in range(total_batch-1):
        index = np.arange(start, end)
        np.random.shuffle(index)
        batch_xs = X_flat[index]
        start = end; end = start+model.batch_size
        #Run optimization op (backprop) and loss op (to get loss value)
        _, c = sess.run([optimizer, cost], feed_dict={X: batch_xs})
    # Display logs per epoch step
    if ((epoch == 0) or (epoch+1) % model.display_step == 0) or ((epoch+1) == model.training_epochs):
        print ('Epoch: {0:04d}   loss: {1:f}'.format(epoch+1, c))
print("Optimization finished")

# Save trained Variables 
weightSaver = tf.train.Saver(var_list=model.weights)
biaseSaver = tf.train.Saver(var_list=model.biases)
save_path = weightSaver.save(sess, hparams.job_dir+"/VAE_weights.ckpt")
save_path = biaseSaver.save(sess, hparams.job_dir+"/VAE_biases.ckpt")


if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Input Arguments
parser.add_argument(
    '--train-dir',
    help='GCS or local paths to training data',
    nargs='+',
    required=True       
)

parser.add_argument(
    '--job-dir',
    help='GCS location to write checkpoints and export models',
    required=True
)
args = parser.parse_args()

hparams=hparam.HParams(**args.__dict__)

run_experiment(hparams)

and here is the inputFn

def input_fn(dir):
images = np.empty((0, 40, 40, 3), dtype='float32')
for pic in glob.glob(dir[0]+'/*.png'):
    img = mpimg.imread(pic)
    # remove alpha channel  %some alpha=0 but RGB is not equal to [1., 1., 1.]
    img[img[:,:,3]==0] = np.ones((1,4))
    img = img[:,:,:3]
    images = np.append(images, [img], axis=0)

return images

My problem is that when I start the training using:

gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.4 \
    --module-name trainer.train_task \
    --package-path trainer/ \
    --region $REGION \
    -- \
    --train-dir $TRAIN_DATA 

with TRAIN_DATA=gs://$BUCKET_NAME/Pokemon

I get this error: ValueError: cannot reshape array of size 0 into shape (2376,1600) Which means it is not fetching any image. The very same code work if I run it locally using the absolute path of the Pokemon folder stored locally.

Does anyone know what am I doing wrong?

All the best.

GhzNcl
  • 149
  • 1
  • 4
  • 13
  • 1
    I think glob might not be able to access GCS directly, so your input_fn might not work. You can use TensorFlow fileio lib to read files on GCS. – Guoqing Xu Apr 09 '18 at 08:42

1 Answers1

1

This question is similar to this one, although it doesn't directly cover matplotlib's imread function.

In short, regular Python file operations such as glob.glob and any functions that internal use regular Python file operations (in this case Matplotlib's imread function uses Python's open function) are incapable of working on GCS. More info can be found in this answer.

Adopting the info to your case, and taking advantage of the fact that imread allows you to pass in file-like objects, you'll need something like:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

def input_fn(dir):
  images = np.empty((0, 40, 40, 3), dtype='float32')
  for pic in file_io.get_matching_files(dir[0]+'/*.png'):
    with file_io.FileIO(pic, 'rb') as f:
      img = mpimg.imread(f)
    # remove alpha channel  %some alpha=0 but RGB is not equal to [1., 1., 1.]
    img[img[:,:,3]==0] = np.ones((1,4))
    img = img[:,:,:3]
    images = np.append(images, [img], axis=0)

  return images
rhaertel80
  • 8,254
  • 1
  • 31
  • 47