1

I have a folder of jpeg images that I'm trying to convert to a folder of tfrecords. The best I can do, from this code, is to write all jpegs to one tfrecords file, but I'm not sure how to use that (large tfrecords file) AND my other starter code requires individual tfrecord files for each image. For example, I was given a folder of 5 tfrecs to use to begin with.

# Source: https://stackoverflow.com/questions/33849617/how-do-i-convert-a-directory-of-jpeg-images-to-tfrecords-file-in-tensorflow
# Note: modified from source
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

# images and labels array as input
def convert_to(images, labels, output_directory, name):
    num_examples = labels.shape[0]
    if images.shape[0] != num_examples:
        raise ValueError("Images size %d does not match label size %d." %
                         (images.shape[0], num_examples))
    rows = images.shape[1]
    cols = images.shape[2]
    depth = 1

    filename = os.path.join(output_directory, name + '.tfrecords')
    print('Writing', filename)
    writer = tf.python_io.TFRecordWriter(filename)
    for index in range(num_examples):
        image_raw = images[index].tobytes()
        example = tf.train.Example(features=tf.train.Features(feature={
            'height': _int64_feature(rows),
            'width': _int64_feature(cols),
            'depth': _int64_feature(depth),
            'label': _int64_feature(int(labels[index])),
            'image_raw': _bytes_feature(image_raw)}))
        writer.write(example.SerializeToString())

Above is my function convert_to, can this be changed to answer my question? Below is the rest, you can see the 2nd to last line (#s) it is correctly given the array and labels from the 300 images.

def read_image(file_name, images_path):
    image = skimage.io.imread(images_path + file_name)
    return image

def extract_image_index_make_label(img_name):
    #remove_ext = img_name.split(".")[0]
    # name, serie, repetition, char = remove_ext.split("_")
    # label = int(char) + 1000 * int(repetition) + 1000_000 * int(serie)
    label = random.randint(1,300)
    return label

images_path = "/content/monet_jpg/"
image_list = os.listdir(images_path)
images = []
labels = []
for img_name in tqdm(image_list):
    images.append(read_image(img_name, images_path))
    labels.append(extract_image_index_make_label(img_name))
images_array = np.array(images)
labels = np.array(labels)
#print(images_array.shape)
print(images_array.shape, labels.shape)
# (300, 256, 256, 3) (300,)

convert_to(images_array, labels, ".", "ALL_MONET_TFREC")

Even using a folder of Tfrecs would still have efficiency benefits over a folder of jpegs correct? Anyway that is what my starter code is setup to use.

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
Gazoo
  • 37
  • 6

1 Answers1

0

I can give you some examples from a real-case situation I have been working on:

  1. Set of images (15000 .jpeg images, 15 classes (i.e. 1000 per class), 224x224x3). The size on disk is 398 MB, while the set of records in TFRecord is 336 MB. In this, we also consider the overhead of other metadata attachment to the TFRecord (for example, a str label attached to every protobuffer instance. Therefore, you may see a reduction of 398-336 = 62 MB, in spite of additional metadata being included. Consider that further reduction could be made if the proto contained only the serialized image.
  2. Training speed. If using tf.data.Dataset() together with TFRecord() dataset, the speed of the training increases. For example, in the same scenario with the dataset, the length of an epoch decreased with ~30 seconds in my case when using tf.data.Dataset() + TFRecord() versus tf.data.Dataset() + tf.data.Dataset.from_tensor_slices() (without changing anything related to the pipeline except for the ones above, no other network, HParams, CPU, GPU etc.)

Indeed, those differences are specific to the task, but I have noticed overall improvements in :

  1. Size alloted on disk (allocating one big chunk takes a bit less space than size resulting from splitting into a lot of chunks).
  2. Training speed (which I find even more important, in my view).

Note also some other advantages not included in my previous example:

Fast I/O: the TFRecord format can be read with parallel I/O operations, which is useful for TPUs or multiple hosts.

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59