What is tf.extract_image_patches method meant for?

Question

I want to divide my images into smaller windows which will be send to a neural net for training (e.g. for face detectors training). I found tf.extract_image_patches method in Tensorflow which seemed like exactly what I need. This question explains what it does.

The example there shows input of (1x10x10x1) (numbers 1 through 100 in order) given the ksize is (1, 3, 3, 1) (and strides (1, 5, 5, 1)). The output is this:

 [[[[ 1  2  3 11 12 13 21 22 23]
    [ 6  7  8 16 17 18 26 27 28]]

   [[51 52 53 61 62 63 71 72 73]
    [56 57 58 66 67 68 76 77 78]]]]

But I'd expect windows like this (of a shape (Nx3x3x1), so that it's N patches/windows of the size 3x3):

[[[1, 2, 3]
  [11, 12, 13]
  [21, 22, 23]]
    ...

So why are all patch values stored in 1D? Does it mean that this method is not meant for the purposes I described above and i can't use it to prepare batches for training? I also found another method for patches extracting, sklearn.feature_extraction.image.extract_patches_2d and this one really does what I was expecting. So should I understand it like that these two methods don't do the same thing?

Maxim · Answer 1 · 2017-11-11T10:05:29.007

Correct, these functions return different tensors (multi-dimensional arrays).

First, tf.extract_image_patches documentation reads:

Returns:

A Tensor. Has the same type as images. 4-D Tensor with shape [batch, out_rows, out_cols, ksize_rows * ksize_cols * depth] containing image patches with size ksize_rows x ksize_cols x depth vectorized in the "depth" dimension. Note out_rows and out_cols are the dimensions of the output patches.

Basically, this says that [1, 2, 3], [11, 12, 13], [21, 22, 23] windows are flattened, or vectorized in the "depth" dimension. The out_rows and out_cols are calculated from the strides argument, which in this case is strides=[1, 5, 5, 1], and by padding, which is 'VALID'. As a result, the output shape is (1, 2, 2, 9).

In other words:

strides changes the spatial dimensions
ksizes changes the depth

Note that the output tensor does contain all individual windows, so you can access them through selection.

On the other hand, sklearn.feature_extraction.image.extract_patches_2d:

Returns:

patches : array, shape = (n_patches, patch_height, patch_width) or (n_patches, patch_height, patch_width, n_channels) The collection of patches extracted from the image, where n_patches is either max_patches or the total number of patches that can be extracted.

This is exactly what you describe: each window takes the whole spatial dimensions patch_height, patch_width. Here, the result shape depends on the patch_size, striding and padding is not supported, and the first dimension is calculated as the total number of patches.

So what is the tensorflow method for? I imagined that "patch extraction" is exactly what the sklearn method does and I'd use it but I need striding... — T.Poe, Nov 11 '17 at 15:22

What is tf.extract_image_patches method meant for?

1 Answers1