13

I have a CSV with approximately 40 million rows. Each row is a training instance. As per the documentation on consuming TFRecords I am trying to encode and save the data in a TFRecord file.

All the examples I have found (even the ones in the TensorFlow repo) show the process of creating a TFRecord is dependant on the class TFRecordWriter. This class has a method write that takes as input a serialised string representation of the data and writes it to disk. However, this appears to be done one training instance at a time.

How do I write a batch of the serialised data?

Let's say I have a funtion:

  def write_row(sentiment, text, encoded):
    feature = {"one_hot": _float_feature(encoded),
               "label": _int64_feature([sentiment]),
               "text": _bytes_feature([text.encode()])}

    example = tf.train.Example(features=tf.train.Features(feature=feature))
    writer.write(example.SerializeToString())

Writing to disk 40 million times (once for each example) is going to be incredibly slow. It would be far more efficient to batch this data and write 50k or 100k examples at a time (as far as the machine's resources will allow). However there does not appear to be any method to do this inside TFRecordWriter.

Something along the lines of:

class MyRecordWriter:

  def __init__(self, writer):
    self.records = []
    self.counter = 0
    self.writer = writer

  def write_row_batched(self, sentiment, text, encoded):
    feature = {"one_hot": _float_feature(encoded),
               "label": _int64_feature([sentiment]),
               "text": _bytes_feature([text.encode()])}

    example = tf.train.Example(features=tf.train.Features(feature=feature))
    self.records.append(example.SerializeToString())
    self.counter += 1
    if self.counter >= 10000:
      self.writer.write(os.linesep.join(self.records))
      self.counter = 0
      self.records = []

But when reading the file created by this method I get the following error:

tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Could not parse example input, value: '
��

label

��
one_hot����
��

Note: I could change the encoding process so that each example proto contains several thousand examples instead of just one but I don't want to pre-batch the data when writing to the TFrecord file in this way as it will introduce extra overhead in my training pipeline when I want to use the file for training with different batch sizes.

Insectatorious
  • 1,305
  • 3
  • 14
  • 29

1 Answers1

7

TFRecords is a binary format. With the following line you are treating it like a text file: self.writer.write(os.linesep.join(self.records))

That is because you are using the operation system depending linesep (either \n or \r\n).

Solution: Just write the records. You are asking to batch write them. You can use a buffered writer. For 40 million rows you might also want to consider splitting the data up into separate files to allow better parallelisation.

When using TFRecordWriter: The file is already buffered.

Evidence for that is found in the source:

  • tf_record.py calls pywrap_tensorflow.PyRecordWriter_New
  • PyRecordWriter calls Env::Default()->NewWritableFile
  • Env->NewWritableFile calls NewWritableFile on the matching FileSystem
  • e.g. PosixFileSystem calls fopen
  • fopen returns a stream which "is fully buffered by default if it is known to not refer to an interactive device"
  • That will be file system dependent but WritableFile notes "The implementation must provide buffering since callers may append small fragments at a time to the file."
de1
  • 2,986
  • 1
  • 15
  • 32
  • Thanks, that clears things up a lot. When you say use a buffered writer, I believe the standard Python `with open("path", "wb")` approach provides a buffered writer with no extra cost. However I can't find any way to check if the class `TFRecordWriter` is also buffering the stream before writing to disk... – Insectatorious Feb 22 '18 at 15:12
  • 1
    The question was how to write bulk data into TFRecords, so how does one do so ? – bluesummers Aug 29 '18 at 07:05
  • @bluesummers there is no need for a special batch or bulk operation. The answer is in the "Solution" section. – de1 Aug 29 '18 at 07:21
  • 1
    I'm sorry but I didn't get the solution - I experience the same issue. How do I buffer write into a TFRecord? the writing is done through the tf.python_io.TFRecordWriter - which has no parameters or options for buffering. This is not the standard `open ('...', 'b')` operation – bluesummers Aug 29 '18 at 07:32
  • @bluesummers the question did not mention `TFRecordWriter`. I added a section showing that `TFRecordWriter` is buffered. – de1 Aug 29 '18 at 12:24
  • 4
    @de1 could you give a simple example showing how to use `TFRecordWriter` for batch write? – Maosi Chen May 10 '19 at 17:32
  • I'm not sure that is true. Even if it is, not being able to control the size of the bulk/buffer is bad practice. – Mr.O Feb 02 '23 at 14:27