7

Consider the problem of creating a dataset of sampling random small image patches from a directory of high-resolution images. The Tensorflow dataset API allows for a very easy way of doing this, by constructing a dataset of image names, shuffling them, mapping it to loaded images, then to random cropped patches.

However, this naive implementation is very inefficient as a separate high-resolution image will be loaded and cropped to generate each patch. Ideally an image could be loaded once and reused to generate many patches.

One simple way that was discussed previously is to generate multiple patches from an image and flatten them. However this has the unfortunate effect of biasing the data too much. We want each training batch to come from different images.

Ideally what I would like is a "random caching filter" transformation that takes an underlying dataset and caches N elements of it into memory. Its iterator will return a random element from the cache. Also, with pre-defined frequency it will replace a random element from the cache with a new one from the underlying dataset. This filter will allow for faster data access at the expense of less randomization and higher memory consumption.

Is there such functionality available?

If not, should it be implemented as a new dataset transformation or simply a new iterator? It seems a new iterator is all that is needed. Any pointers on how to create a new dataset iterator, ideally in C++?

Olivier Moindrot
  • 27,908
  • 11
  • 92
  • 91
Dimofeevich
  • 99
  • 1
  • 3
  • 1
    augmentation is the process of generating more training data by augmenting the samples by applying a number of random transformations that yield believable looking images, NOT "sampling random small image patches " – Mitch Wheat Feb 14 '18 at 00:40
  • @Dimofeevich: isn't [`tf.data.Dataset.shuffle`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) exactly doing what you want? It has a buffer of elements and randomly samples one when called. – Olivier Moindrot Feb 15 '18 at 02:15
  • @OlivierMoindrot I had to study the code because the documentation for shuffle is lacking. Sadly it is not what I want. The only difference is the replacement policy in the cache. shuffle appears to evict from the cache an element once it is used. Instead I would like to evict much less frequently from the cache, which would allow me to improve the I/O efficiency at the cost occasionally repeating the same element. Since I would follow this filter with random crop, the same image will result in different crops, so it is ok – Dimofeevich Feb 15 '18 at 17:47
  • @Dimofeevich it's been a while since you posted this. Did you ever come up with a solution that uses tfdata? The suggested solutions don't totally address your bias concerns in my opinion. Nor do they allow for a "review" of your patches for analysis. I've been using a generator class(caching big images and dynamically sampling according to a pre-set sample list) that addresses all this but it doesn't use tfdata etc.. When I say "review" my patches, some areas of images(or in my case mapping layers) are over/under-represented and so being able to adjust how many are included is useful. – user1269942 Oct 18 '18 at 06:54
  • What I ended up doing is shuffle the images, followed by extracting N patches from each image, followed by a shuffle of the patches. Seems to work well – Dimofeevich Sep 09 '19 at 17:48

1 Answers1

17

You should be able to use tf.data.Dataset.shuffle to achieve what you want. Here is a quick summary for the objectives:

  • load very big images, produce smaller random crops from the images and batch them together
  • make the pipeline efficient by creating multiple patches from a big image once the image is loaded
  • add enough shuffle so that a batch of patches is diverse (all the patches come from different images)
  • don't load too many big images in cache

You can achieve all that using the tf.data API by doing the following steps:

  1. shuffle the filenames of the big images
  2. read the big images
  3. generate multiple patches from this image
  4. shuffle again all these patches with a big enough buffer size (see this answer on buffer size). Adjusting the buffer size is a tradeoff between good shuffling and size of the cached patches
  5. batch them
  6. prefetch one batch

Here is a the relevant code:

filenames = ...  # filenames containing the big images
num_samples = len(filenames)

# Parameters
num_patches = 100               # number of patches to extract from each image
patch_size = 32                 # size of the patches
buffer_size = 50 * num_patches  # shuffle patches from 50 different big images
num_parallel_calls = 4          # number of threads
batch_size = 10                 # size of the batch

get_patches_fn = lambda image: get_patches(image, num_patches=num_patches, patch_size=patch_size)

# Create a Dataset serving batches of random patches in our images
dataset = (tf.data.Dataset.from_tensor_slices(filenames)
    .shuffle(buffer_size=num_samples)  # step 1: all the  filenames into the buffer ensures good shuffling
    .map(parse_fn, num_parallel_calls=num_parallel_calls)  # step 2
    .map(get_patches_fn, num_parallel_calls=num_parallel_calls)  # step 3
    .apply(tf.contrib.data.unbatch())  # unbatch the patches we just produced
    .shuffle(buffer_size=buffer_size)  # step 4
    .batch(batch_size)  # step 5
    .prefetch(1)  # step 6: make sure you always have one batch ready to serve
)

iterator = dataset.make_one_shot_iterator()
patches = iterator.get_next()  # shape [None, patch_size, patch_size, 3]


sess = tf.Session()
res = sess.run(patches)

The functions parse_fn and get_patches are defined like this:

def parse_fn(filename):
    """Decode the jpeg image from the filename and convert to [0, 1]."""
    image_string = tf.read_file(filename)

    # Don't use tf.image.decode_image, or the output shape will be undefined
    image_decoded = tf.image.decode_jpeg(image_string, channels=3)

    # This will convert to float values in [0, 1]
    image = tf.image.convert_image_dtype(image_decoded, tf.float32)

    return image


def get_patches(image, num_patches=100, patch_size=16):
    """Get `num_patches` random crops from the image"""
    patches = []
    for i in range(num_patches):
        patch = tf.image.random_crop(image, [patch_size, patch_size, 3])
        patches.append(patch)

    patches = tf.stack(patches)
    assert patches.get_shape().dims == [num_patches, patch_size, patch_size, 3]

    return patches
Samuel Prevost
  • 1,047
  • 1
  • 11
  • 30
Olivier Moindrot
  • 27,908
  • 11
  • 92
  • 91
  • I <3 random small nuggets of knowledge embedded into great answers: _Don't use tf.image.decode_image, or the output shape will be undefined_. – Ciprian Tomoiagă Jun 14 '18 at 08:27
  • What if you have two images, one for training and the other as the target and you need the indices of the random patches to be the same across the images. So if the first patch extracted from image 1 is `[2:10, :]`, the first patch extracted from image 2 is also `[2:10, :]`. Is there an efficient way to do this? – Luke Nov 02 '18 at 19:09
  • 2
    You can call `tf.random_crop` on the concatenation of the input and output image: `tf.random_crop([image, output], size=[2, patch_size, patch_size, 3])`. The `size` argument needs to begin with `2` since you always want to keep the two images. – Olivier Moindrot Nov 02 '18 at 19:34
  • 1
    Anyone else seem to be getting a memory leak from this? My memory usage expands on every epoch. – Luke Dec 26 '19 at 20:57
  • 1
    @OlivierMoindrot the method in `get_patches` doesn't appear to be performant in tf2, even with a `tf.function` annotation. Any ideas on how to speed it up? – Luke Mar 20 '20 at 17:09
  • @luke Did you ever solve the memory or performance problems here? I’m noticing a lot of memory leak on tf.image.extract_patches, and I was hoping it would use the GPU, but it doesn’t look like it. – bw4sz Jun 17 '20 at 23:27
  • I didn't unfortunately – Luke Jun 18 '20 at 01:19