Tensorflow: Load unknown TFRecord dataset

Question

I got a TFRecord data file filename = train-00000-of-00001 which contains images of unknown size and maybe other information as well. I know that I can use dataset = tf.data.TFRecordDataset(filename) to open the dataset.

How can I extract the images from this file to save it as a numpy-array?

I also don't know if there is any other information saved in the TFRecord file such as labels or resolution. How can I get these information? How can I save them as a numpy-array?

I normally only use numpy-arrays and am not familiar with TFRecord data files.

MajesticKhan · Accepted Answer · 2019-02-18T02:40:22.293

1.) How can I extract the images from this file to save it as a numpy-array?

What you are looking for is this:

record_iterator = tf.python_io.tf_record_iterator(path=filename)

for string_record in record_iterator:
  example = tf.train.Example()
  example.ParseFromString(string_record)

  print(example)

  # Exit after 1 iteration as this is purely demonstrative.
  break

2.) How can I get these information?

Here is the official documentation. I strongly suggest that you read the documentation because it goes step by step in how to extract the values that you are looking for.

Essentially, you have to convert example to a dictionary. So if I wanted to find out what kind of information is in a tfrecord file, I would do something like this (in context with the code stated in the first question): dict(example.features.feature).keys()

3.) How can I save them as a numpy-array?

I would build upon the for loop mentioned above. So for every loop, it extracts the values that you are interested in and appends them to numpy arrays. If you want, you could create a pandas dataframe from those arrays and save it as a csv file.

But...

You seem to have multiple tfrecord files...tf.data.TFRecordDataset(filename) returns a dataset that is used to train models.

So in the event for multiple tfrecords, you would need a double for loop. The outer loop will go through each file. For that particular file, the inner loop will go through all of the tf.examples.

EDIT:

Converting to np.array()

import tensorflow as tf
from PIL import Image
import io

for string_record in record_iterator:
  example = tf.train.Example()
  example.ParseFromString(string_record)

  print(example)

  # Get the values in a dictionary
  example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
  image_array = np.array(Image.open(io.BytesIO(example_bytes)))
  print(image_array)
  break

Sources for the code above:

Base code
Converting bytes to PIL.JpegImagePlugin.JpegImageFile
Converting from PIL.JpegImagePlugin.JpegImageFile to np.array

Official Documentation for PIL

EDIT 2:

import tensorflow as tf
from PIL import Image
import io
import numpy as np

# Load image
cat_in_snow  = tf.keras.utils.get_file(path, 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')

#------------------------------------------------------Convert to tfrecords
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def image_example(image_string):
  feature = {
      'image_raw': _bytes_feature(image_string),
  }
  return tf.train.Example(features=tf.train.Features(feature=feature))

with tf.python_io.TFRecordWriter('images.tfrecords') as writer:
    image_string = open(cat_in_snow, 'rb').read()
    tf_example = image_example(image_string)
    writer.write(tf_example.SerializeToString())
#------------------------------------------------------


#------------------------------------------------------Begin Operation
record_iterator = tf.python_io.tf_record_iterator(path to tfrecord file)

for string_record in record_iterator:
  example = tf.train.Example()
  example.ParseFromString(string_record)

  print(example)

  # OPTION 1: convert bytes to arrays using PIL and IO
  example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
  PIL_array = np.array(Image.open(io.BytesIO(example_bytes)))

  # OPTION 2: convert bytes to arrays using Tensorflow
  with tf.Session() as sess:
      TF_array = sess.run(tf.image.decode_jpeg(example_bytes, channels=3))

  break
#------------------------------------------------------


#------------------------------------------------------Compare results
(PIL_array.flatten() != TF_array.flatten()).sum()
PIL_array == TF_array

PIL_img = Image.fromarray(PIL_array, 'RGB')
PIL_img.save('PIL_IMAGE.jpg')

TF_img = Image.fromarray(TF_array, 'RGB')
TF_img.save('TF_IMAGE.jpg')
#------------------------------------------------------

Remember that tfrecords is just simply a way of storing information for tensorflow models to read in an efficient manner.
I use PIL and IO to essentially convert the bytes to an image. IO takes the bytes and converts them to a file like object that PIL.Image can then read
Yes, there is a pure tensorflow way to do it: tf.image.decode_jpeg
Yes, there is a difference between the two approaches when you compare the two arrays
Which one should you pick? Tensorflow is not the way to go if you are worried about accuracy as stated in Tensorflow's github : "The TensorFlow-chosen default for jpeg decoding is IFAST, sacrificing image quality for speed". Credit for this information belongs to this post

With your help I can see now that every example consists of several `features` which all have a `key` and a `value`. These keys are `label`, `image/encoded`, `image/width` and `image/height`. It is still not clear to me how to read this information and how to convert the encoded image into a matrix with the dimensions `WxHxC`. — Gilfoyle, Feb 16 '19 at 12:01
Great! Now I can extract the image. However, I don't understand why it works. Why do I have to load `PIL` and `io`? What exactly do these libraries? Is there a pure tensorflow way to do it? I would like to be able to extract the images doing something like `image, label = sess.run(extract_image, feed_dict={path:filename})` — Gilfoyle, Feb 17 '19 at 11:54
I am getting: DecodeError: Error parsing message at 1, using this record: !wget -O "/tmp/train.tfrecords" "http://ciir.cs.umass.edu/downloads/Antique/tf-ranking/ELWC/train.tfrecords" — PascalIv, May 19 '20 at 07:39

Tensorflow: Load unknown TFRecord dataset

1 Answers1