Perhaps related to this question, but my goal is to have a network perform a manipulation on an input image and output the resulting image data.
In the event this question lacks clarity, I would be glad to delve into deeper detail of my problem in the comments. However, I'll try to be as non-case-specific as possible, to make this question of use to others.
The Problem
I have a plethora of training data consisting of images before and after the proposed manipulation. My question relates to how I can train 1-to-1 each pixel using Caffe. My loss should take the form of something that computes the difference between the two images.
If I have my last fully-connected/inner-product layer outputting channels * height * width
and I have my label of the expected output image (same dimensions) what type of loss+accuracy structure should I use?
My Case
I've tried simply passing the inner-product data to a sigmoid cross-entropy loss with my label data, but it doesn't seem to be a supported method.
My labels are non integer values, as they are pixel RGB data between 0 and 1 (Note: I could use integers in the form of 0 to 255) and Caffe seems to interpret labels as categories as opposed to simple values.
I could have 255 categories per pixel channel, but that would result in 255 * 3 channels * 256 height * 256 width = 50,135,040 categories, which is absurdly over complicating what I'm trying to achieve.
My Questions
- Does Caffe natively support what I'm trying to achieve?
- If so, how should I change my structure to conform to these specification?
- If not, do any other neural network frameworks such as Torch support this?
- Is there a name for the type of problem I'm trying to solve with my network (certainly not categorical classification)?
- What has been used in the past to solve this style of problem?
Sources With Potential Value
- Training Super-Resolution Image Upscaling
- Karol Gregor on Variational Autoencoders and Image Generation (less relevant as this touches on model reconstruction)