Same vs Different Target Values for each sample for Regression in Machine Learning

Question

I am a newbie in machine learning and learning the basic concepts in regression. The confusion I have can be well explained by placing an example of input samples with the target values. So, For example (please notice that the example I am putting is the general case, I observed the performance and predicted values on a large custom dataset of images. Also, notice that the target values are not in floats.), I have:

xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]

and

xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]

As you can notice that the ever three (two in the test set) samples have similar target values. Suppose I have a multi-layer perceptron network with one Flatten() and two Dense() layers. The network, after training, predicts the target values all same for test samples:

yPredicted = [40, 40, 40, 40]

Because the predicted values are all same, the correlations between ytest and yPredicted return null and give an error.

But when I have:

xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [332, 433, 456, 675, 234, 879, 242, 634, 789, 432, 897, 982]

And:

xtest = [13, 14, 15, 16]
ytest = [985, 341, 354, 326]

The predicted values are:

yPredicted = [987, 345, 435, 232]

which gives very good correlations.

My question is, what it the thing or process in a machine learning algorithm that makes the learning better when having different target values for each input? Why the network does not work when having repeated values for a large number of inputs?

desertnaut · Accepted Answer · 2019-05-21T22:46:35.920

Why the network does not work when having repeated values for a large number of inputs?

Most certainly, this is not the reason why your network does not perform well in the first dataset shown.

(You have not provided any code, so inevitably this will be a qualitative answer)

Looking closely at your first dataset:

xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]

it's not difficult to conclude that we have a monotonic (increasing) function y(x) (it is not strictly monotonic, but it is monotonic nevertheless over the whole x range provided).

Given that, your model has absolutely no way of "knowing" that, for x > 12, the qualitative nature of the function changes significantly (and rather abruptly), as apparent from your test set:

xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]

and you should not expect it to know or "guess" it in any way (despite what many people may seem to believe, NN are not magic).

Looking closely to your second dataset, you will realize that this is not the case with it, hence the network is unsurprisingly able to perform better here; when doing such experiments, it is very important to be sure that we are comparing apples to apples, and not apples to oranges.

Another general issue with your attempts here and your question is the following: neural nets are not good at extrapolation, i.e. predicting such numerical functions outside the numeric domain on which they have been trained. For details, please see own answer at Is deep learning bad at fitting simple non linear functions outside training scope?

A last unusual thing here is your use of correlation; not sure why you choose to do this, but you may be interested to know that, in practice, we never assess model performance using a correlation measure between predicted outcomes and ground truth - we use measures such as the mean squared error (MSE) instead (for regression problems, such as yours here).

@RheateyBash you are very welcome to elaborate, or to offer an alternative one — desertnaut, May 21 '19 at 22:11
@RheateyBash at least desertnaut showed courage to answer a question which has been voted down twice by some nonsense users. — sana, May 23 '19 at 09:12

Same vs Different Target Values for each sample for Regression in Machine Learning

1 Answers1