0

I want to divide my images into smaller windows which will be send to a neural net for training (e.g. for face detectors training). I found tf.extract_image_patches method in Tensorflow which seemed like exactly what I need. This question explains what it does.

The example there shows input of (1x10x10x1) (numbers 1 through 100 in order) given the ksize is (1, 3, 3, 1) (and strides (1, 5, 5, 1)). The output is this:

 [[[[ 1  2  3 11 12 13 21 22 23]
    [ 6  7  8 16 17 18 26 27 28]]

   [[51 52 53 61 62 63 71 72 73]
    [56 57 58 66 67 68 76 77 78]]]]

But I'd expect windows like this (of a shape (Nx3x3x1), so that it's N patches/windows of the size 3x3):

[[[1, 2, 3]
  [11, 12, 13]
  [21, 22, 23]]
    ...

So why are all patch values stored in 1D? Does it mean that this method is not meant for the purposes I described above and i can't use it to prepare batches for training? I also found another method for patches extracting, sklearn.feature_extraction.image.extract_patches_2d and this one really does what I was expecting. So should I understand it like that these two methods don't do the same thing?

T.Poe
  • 1,949
  • 6
  • 28
  • 59

1 Answers1

0

Correct, these functions return different tensors (multi-dimensional arrays).

First, tf.extract_image_patches documentation reads:

Returns:

A Tensor. Has the same type as images. 4-D Tensor with shape [batch, out_rows, out_cols, ksize_rows * ksize_cols * depth] containing image patches with size ksize_rows x ksize_cols x depth vectorized in the "depth" dimension. Note out_rows and out_cols are the dimensions of the output patches.

Basically, this says that [1, 2, 3], [11, 12, 13], [21, 22, 23] windows are flattened, or vectorized in the "depth" dimension. The out_rows and out_cols are calculated from the strides argument, which in this case is strides=[1, 5, 5, 1], and by padding, which is 'VALID'. As a result, the output shape is (1, 2, 2, 9).

In other words:

  • strides changes the spatial dimensions
  • ksizes changes the depth

Note that the output tensor does contain all individual windows, so you can access them through selection.


On the other hand, sklearn.feature_extraction.image.extract_patches_2d:

Returns:

patches : array, shape = (n_patches, patch_height, patch_width) or (n_patches, patch_height, patch_width, n_channels) The collection of patches extracted from the image, where n_patches is either max_patches or the total number of patches that can be extracted.

This is exactly what you describe: each window takes the whole spatial dimensions patch_height, patch_width. Here, the result shape depends on the patch_size, striding and padding is not supported, and the first dimension is calculated as the total number of patches.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • So what is the tensorflow method for? I imagined that "patch extraction" is exactly what the sklearn method does and I'd use it but I need striding... – T.Poe Nov 11 '17 at 15:22