4

Let's consider I have a neural network with one single output neuron. To outline the scenario: the network gets an image as input and should find one single object in that image. For simplifying the scenario, it should just output the x-coordinate of the object.

However, since the object can be at various locations, the network's output will certainly have some noise on it. Additionally the image can be a bit blurry and stuff.

Therefore I thought it might be a better idea to have the network output a gaussian distribution of the object's location.

Unfortunately I am struggling to model this idea. How would I design the output? A flattened 100 dimensional vector if the image has a width of 100 pixels? So that the network can fit in a gaussian distribution in this vector and I just need to locate the peaks for getting the approximated object's location?

Additionally I fail in figuring out the cost function and teacher signal. Would the teacher signal be a perfect gaussian distribution on the exact x-coordination of the object? How to model the cost function, then? Currently I have a softmax cross entropy or simply a squared error: network's output <-> real x coordinate.

Is there maybe a better way to handle this scenario? Like a better distribution or any other way to have the network not output a single value without any information of the noise and so on?

daniel451
  • 10,626
  • 19
  • 67
  • 125
  • I'm voting to close this question as off-topic because you seem to be asking SO to propose and formulate the answer to an open-ended research topic. This is not a suitable question for StackOverflow. – pjs Jan 17 '16 at 20:05
  • Typically a Mixture Density Network is used for this problem: http://publications.aston.ac.uk/373/1/NCRG_94_004.pdf – QuadmasterXLII Sep 15 '18 at 03:34

1 Answers1

1

Sounds like what you really need is a convolutional network.

You could train a network to recognize your target object when it's positioned in the center of the network's receptive field. You can then create a moving window, at each step feeding the portion of the larger image under that window into the net. If you keep track of the outputs of the trained network for each (x,y) position of the window, some locations of the window will produce better matches than others. Once you've covered the whole image, you can pick the position with the maximum network output as the position where the target object is most likely located.

To handle scale and rotation variations, consider creating an image pyramid, or sets of images at different scales and rotations that are versions of the original image. Then sieve over those images to find the target image.

Justas
  • 5,718
  • 2
  • 34
  • 36