I don't have answers for you, but I can offer some observations.
If I understand your settings correctly, you have a loss function for each output y_i
. Each loss is a regression loss forcing y_i
in a particular range.
1. Since your outputs are "pulling" to different ranges, this might cause the weight matrix of the last layer to have very different scales for different rows. If you are using a regularizer (like L2) this may "confuse" the learning process trying to make the weight roughly isotropic.
To overcome this, you can either relax the regularization on the last layer's weight (using decay_mult
parameter). Alternatively, you can add a "Scale"
layer to learn only a scale factor (and maybe bias as well) for each output.
2. I don't understand what you are trying to accomplish by this. Are you trying to bound the outputs? You can get bounded outputs by applying "Sigmoid"
or "Tanh"
activation to each output, forcing each to [0..1] or [-1..1] range. (You can add "Scale"
layer after the activation).
3. You can use logistic regression for each of the outputs, or explore smooth L1 loss (which should be more robust, especially if targets are not in range [-1..1]).