I am a newbie in machine learning and learning the basic concepts in regression. The confusion I have can be well explained by placing an example of input samples with the target values. So, For example (please notice that the example I am putting is the general case, I observed the performance and predicted values on a large custom dataset of images. Also, notice that the target values are not in floats.), I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
and
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
As you can notice that the ever three (two in the test set) samples have similar target values. Suppose I have a multi-layer perceptron network with one Flatten() and two Dense() layers. The network, after training, predicts the target values all same for test samples:
yPredicted = [40, 40, 40, 40]
Because the predicted values are all same, the correlations between ytest and yPredicted return null and give an error.
But when I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [332, 433, 456, 675, 234, 879, 242, 634, 789, 432, 897, 982]
And:
xtest = [13, 14, 15, 16]
ytest = [985, 341, 354, 326]
The predicted values are:
yPredicted = [987, 345, 435, 232]
which gives very good correlations.
My question is, what it the thing or process in a machine learning algorithm that makes the learning better when having different target values for each input? Why the network does not work when having repeated values for a large number of inputs?