9

I would like to classify pixels of an image to "is street" or "is not street". I have some training data from the KITTI dataset and I have seen that Caffe has an IMAGE_DATA layer type. The labels are there in form of images of the same size as the input image.

Besides Caffe, my first idea to solve this problem was by giving image patches around the pixel which should get classified (e.g. 20 pixels to the top / left / right / bottom, resulting in 41×41=1681 features per pixel I want to classify.
However, if I could tell caffe how to use the labels without having to create those image patches manually (and the layer type IMAGE_DATA seems to suggest that it is possible) I would prefer that.

Can Caffe classify pixels of an image directly? How would such a prototxt network definition look like? How do I give Caffe the information about the labels?

I guess the input layer would be something like

layers {
  name: "data"
  type: IMAGE_DATA
  top: "data"
  top: "label"
  image_data_param {
    source: "path/to/file_list.txt"
    mean_file: "path/to/imagenet_mean.binaryproto"
    batch_size: 4
    crop_size: 41
    mirror: false
    new_height: 256
    new_width: 256
  }
}

However, I am not sure what crop_size exactly means. Is it really centered? How does caffe deal with the corner pixels? What is new_height and new_width good for?

Shai
  • 111,146
  • 38
  • 238
  • 371
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • your question is very big in a sense that it touches many subjects. Can you "break" it into smaller questions? one topic per question? you can (and should?) link the questions to give context. – Shai May 13 '15 at 06:05
  • See also: [Question on Google Groups](https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/caffe-users/AjcfGsxpWrc/lu4YBhWrwA0J) – Martin Thoma May 13 '15 at 10:12

2 Answers2

8

Can Caffe classify pixels? in theory I think the answer is Yes. I didn't try it myself, but I don't think there is anything stopping you from doing so.

Inputs:
You need two IMAGE_DATA layers: one that loads the RGB image and another that loads the corresponding label-mask image. Note that if you use convert_imageset utility you cannot shuffle each set independently - you won't be able to match an image to its label-mask.

An IMAGE_DATA layer has two "tops" one for "data" and one for "label" I suggest you set the "label"s of both input layers to the index of the image/label-mask and add a utility layer that verifies that the indices always matches, this will prevent you from training on the wrong label-masks ;)

Example:

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "data-idx"
  # paramters...
}
layer {
  name: "label-mask"
  type: "ImageData"
  top: "label-mask"
  top: "label-idx"
  # paramters...
}
layer {
  name: "assert-idx"
  type: "EuclideanLoss"
  bottom: "data-idx"
  bottom: "label-idx"
  top: "this-must-always-be-zero"
}

Loss layer:
Now, you can do whatever you like to the input data, but eventually to get pixel-wise labeling you need pixel-wise loss. Therefore, you must have your last layer (before the loss) produce a prediction with the same width and height as the "label-mask" Not all loss layers knows how to handle multiple labels, but "EuclideanLoss" (for example) can, therefore you should have a loss layer something like

layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "prediction" # size on image
  bottom: "label-mask"
  top: "loss"
}

I think "SoftmaxWithLoss" has a newer version that can be used in this scenario, but you'll have to check it our yourself. In that case "prediction" should be of shape 2-by-h-by-w (since you have 2 labels).

Additional notes:
Once you set the input size in the parameters of the "ImageData" you fix the sizes of all blobs of the net. You must set the label size to the same size. You must carefully consider how you are going to deal with images of different shape and sizes.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • I tried to address the main issues raised in your question, regarding the details of the parameters of `IMAGE_DATA` layer - please ask a different specific question about them. – Shai May 13 '15 at 06:25
  • Could you explain more specifically why the shape has to be 2-by-h-by-w. As far as I have understood the EuclideanLoss has to have the same dimensions as the label, i.e. if the label is an grayscale image there would be only 1 channel and therefore the prediction had to be of shape 1-by-h-by-w? –  Nov 11 '16 at 00:20
  • What would be the `num_output` in the last convolutional layer or are you using a `fully connected layer` and reshape the output accordingly? @Shai @Martin Thoma –  Nov 15 '16 at 20:36
  • @thigi if you are using a `"Convolution"` layer, then `num_output` should equal the number of labels. If you are using `"InnerProduct"` param you would have to `"Reshape"` your prediction to get the proper shape for the loss layer. – Shai Nov 15 '16 at 20:42
  • If I use EuclideanLoss the num_output has to be the same as the number of labels as well? Would you reshape after or before the loss layer? @Shai –  Nov 15 '16 at 20:52
  • Have a look at that question if you do not understand what I mean: [link](http://stackoverflow.com/questions/40588551/caffe-how-to-convert-network-from-pixel-wise-segmentation-to-pixel-wise-regress) @Shai –  Nov 15 '16 at 20:59
7

Seems you can try fully convolutional networks for semantic segmentation

Caffe was cited in this paper: https://github.com/BVLC/caffe/wiki/Publications

Also here is the model: https://github.com/BVLC/caffe/wiki/Model-Zoo#fully-convolutional-semantic-segmentation-models-fcn-xs

Also this presentation can be helpfull: http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-pixels.pdf

mrgloom
  • 20,061
  • 36
  • 171
  • 301