1

I am trying to serialize variable-length training data using tensorflow but I am unable to reconstruct it because I cannot think of a way to pass the length of each training instance.

Serialize data:

import tensorflow as tf
import numpy as np

data = [["foo", "bar", "baz"], ["the", "quick", "brown", "fox"]]

def _bytes_feature(val):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[val]))

def _int64_feature(val):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[val]))

def serialize_data(input_data, path):
    """ iterate over and serialize data. """
    writer = tf.python_io.TFRecordWriter(path)
    datums = len(input_data)
    for i in range(datums):
        data_len = len(input_data[i])
        raw_data = np.array(input_data[i]).tostring()
        this_example = tf.train.Example(
            features = tf.train.Features(feature={
                "raw_data": _bytes_feature(raw_data),
                "data_len": _int64_feature(data_len)
            }))

        writer.write(this_example.SerializeToString())
    writer.close()

if __name__ == "__main__":
    serialize_data(data, "./out.tfrecord")

My solution is to record the length of each data point and pack that into each example, then, when reading in the data for training, use the length to reshape the raw data. The problem is when I reconstruct data_len it is a tf.Tensor and cannot be used to reshape the raw data.

Error:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'

Import data (code that produces the error):

dataset = tf.contrib.data.TFRecordDataset(["out.tfrecord"])

def extract_raw_data(my_example):
    features = {
        "raw_data": tf.FixedLenFeature([], tf.string),
        "data_len": tf.FixedLenFeature([], tf.string),
    }
    parsed_features = tf.parse_single_example(my_example, features)
    data = tf.decode_raw(parsed_features['raw_data'], tf.string)
    len_data = tf.decode_raw(parsed_features['data_len'], tf.int32)
    # data.set_shape() <-- use len_data here to reshape data
    return data

dataset = dataset.map(extract_raw_data)

One solution I've considered is finding the max length of all training data instances and padding each instance and then simply hard coding the reshape value (much like one would for processing images) but I am wondering if there is a way to pass each training instance length and reconstruct it.

Thanks.

o-90
  • 17,045
  • 10
  • 39
  • 63
  • Actually, this has basically already been asked. https://stackoverflow.com/questions/43019852/tensorflow-getting-scalar-tensor-value-as-int-for-pass-to-set-shape – o-90 Oct 02 '17 at 20:06

0 Answers0