I have a CSV with approximately 40 million rows. Each row is a training instance. As per the documentation on consuming TFRecords I am trying to encode and save the data in a TFRecord file.
All the examples I have found (even the ones in the TensorFlow repo) show the process of creating a TFRecord is dependant on the class TFRecordWriter. This class has a method write
that takes as input a serialised string representation of the data and writes it to disk. However, this appears to be done one training instance at a time.
How do I write a batch of the serialised data?
Let's say I have a funtion:
def write_row(sentiment, text, encoded):
feature = {"one_hot": _float_feature(encoded),
"label": _int64_feature([sentiment]),
"text": _bytes_feature([text.encode()])}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
Writing to disk 40 million times (once for each example) is going to be incredibly slow. It would be far more efficient to batch this data and write 50k or 100k examples at a time (as far as the machine's resources will allow). However there does not appear to be any method to do this inside TFRecordWriter
.
Something along the lines of:
class MyRecordWriter:
def __init__(self, writer):
self.records = []
self.counter = 0
self.writer = writer
def write_row_batched(self, sentiment, text, encoded):
feature = {"one_hot": _float_feature(encoded),
"label": _int64_feature([sentiment]),
"text": _bytes_feature([text.encode()])}
example = tf.train.Example(features=tf.train.Features(feature=feature))
self.records.append(example.SerializeToString())
self.counter += 1
if self.counter >= 10000:
self.writer.write(os.linesep.join(self.records))
self.counter = 0
self.records = []
But when reading the file created by this method I get the following error:
tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Could not parse example input, value: '
��
label
��
one_hot����
��
Note: I could change the encoding process so that each example
proto contains several thousand examples instead of just one but
I don't want to pre-batch the data when writing to the TFrecord file in this way as it will introduce extra overhead in my training pipeline when I want to use the file for training with different batch sizes.