Convergence issue with multi linear regression problems

Question

I'm try to solve a multi-linear-regression problem with a very simple linear network. The network only consists of a single dense layer as its output layer and the activation function is set to linear. I synthesize the output data Y by multiplying the input data X by the system (weight) matrix A: Y=A.X . Both X and A contain random numbers with normal or uniform distributions (the problem happens regardless). In this case, the network reaches above 99% accuracy in only 7 Epochs over 1000 samples as one would expect.

Now, if I synthesize X from Y, which this time around has uniform random numbers, using A's inverse: X = inv(A).Y , and try to train the network, after two hundred Epochs, the accuracy only reaches 94%.

Why is there such a huge disparity between the two cases even-though the system matrix (weights) is exactly the same. The only difference is related to the random distribution of X and Y. If I'm forced to follow the second case, how can I improve the trainability of my network so that it can be trained in few epochs.

I have tried different optimizers, initializers and regularizations but they didn't help.

Here's the code for the version that doesn't converge so well. To get the first version I use gen1 in Dataset.from_generator(gen2, ...) instead of gen2.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras

N = 256
np.random.seed(0)
A = np.random.normal(0,.4,(N,N))
Ainv = np.linalg.inv(A)

import itertools

input_size = N
def gen1():
    for i in itertools.count(1):
        X = np.random.rand(N,1)-.5
        Y = np.dot(A,X)
        yield (X[:,0],Y[:,0])

def gen2():
    for i in itertools.count(1):
        Y = np.random.rand(N,1)-0.5
        X = np.dot(Ainv,Y)
        yield (X[:,0],Y[:,0])

dataset = tf.data.Dataset.from_generator(
     gen2,
     (tf.float64, tf.float64),
     (tf.TensorShape([N]), tf.TensorShape([N])))

train_ds = dataset.take(950)
valid_ds = dataset.skip(950).take(50)

#train_ds = train_ds.shuffle(2000, reshuffle_each_iteration = True)

train_ds = train_ds.batch(1)
valid_ds = valid_ds.batch(1)

from keras.layers import Input, Dense
from keras.models import Model
from keras import backend
 
def rabs(y_t, y_p):
    return backend.mean(backend.abs(y_p - y_t), axis=-1)/(tf.keras.backend.max(y_t) - tf.keras.backend.min(y_t))*100

inp = Input(shape=(input_size,))
out = Dense(N, activation='linear')(inp)

autoencoder = Model(inp, out)

#opt = tf.keras.optimizers.Adam(learning_rate=.0001)
opt = tf.keras.optimizers.SGD(learning_rate=.2, momentum=0.7)
autoencoder.compile(optimizer= opt,
              loss=tf.keras.losses.MeanSquaredError(),metrics= [rabs])

autoencoder.summary()

autoen_model = autoencoder.fit(train_ds, validation_data = valid_ds, epochs = 200)

plt.plot(autoen_model.history['rabs'])
plt.plot(autoen_model.history['val_rabs'])
plt.title('Model Accuracy')
plt.ylabel('Relative Absolute Mean Error %')
plt.xlabel('Epoch')
plt.legend(['Training set', 'Validation set'], loc='upper left')
plt.show()

Training graphs

Case 1: Y synthesized

Case 2: X synthesized

score 1 · Accepted Answer · answered Dec 21 '20 at 12:46

Why I think this is happening

I'm going to ignore that you're doing stochastic gradient descent, and just imagine that you're working with the entire dataset for each step. In this case, your problem in both cases is to minimize ||Y-AX||^2 over A.

After doing some algebra, you can write this as a quadratic optimization problem of the form

\min_{z} z^T Q z + b^T z,

where z \in R^{256^2} represents the entries of the matrix A, Q is a symmetric matrix obtained only from X, and b is a vector obtained from X and Y. What you are asking Tensorflow to do is to solve this problem using gradient descent.

The convergence rate of gradient descent on this type of problem is governed by the condition number of Q, which is its largest eigenvalue divided by its smallest. A condition number that is much larger than one leads to slow gradient descent, as some variables update much faster than others. A condition number closer to one is best for obtaining fast convergence. In Guler's Foundations of Optimization (Section 14.2) you can read more about the effect of condition number on convergence of (a variant of) gradient descent, though there are probably better resources on this out there.

In your case, the eigenvalues of Q are just the eigenvalues of XX^T, which are the squared singular values of X. For the first dataset, X is uniformly distributed, and in the second X= A_0^{-1} Y, where Y is uniformly distributed.

The difference in convergence you are observing comes from the fact that multiplication by A_0^{-1} wildly increases the condition number of your matrix. In the following python code I did some random trials of this, and found that the condition number of the second matrix is way bigger. Thousands of times bigger.

import numpy as np

cond1 = []
cond2 = []


for i in range(10):
    A = np.random.normal(0,0.4,(256,256))
    Ainv = np.linalg.inv(A)

    X1 = np.random.rand(256,950)
    X1sv = np.linalg.svd(X1, compute_uv = False)

    Y = np.random.rand(256,950)
    X2 = np.dot(Ainv,Y)
    X2sv = np.linalg.svd(X2, compute_uv = False)

    cond1.append((X1sv.max()/X1sv.min())**2)
    cond2.append((X2sv.max()/X2sv.min())**2)
cond1 = np.array(cond1)
cond2 = np.array(cond2)

print('X1\'s condition number has mean {:.2f} and std {:.2f} '.format(cond1.mean(), cond1.std()))
print('X2\'s condition number has mean {:.2f} and std {:.2f} '.format(cond2.mean(), cond2.std()))
print('X2\'s mean condition number is {:.1f} times as big as X1\'s'.format(cond2.mean()/cond1.mean()))

So that's my guess as to why you're seeing worse convergence for the second case than the first. I could be wrong, but maybe this will point you in the right direction.

Suggested Solutions

There are a couple of solutions to this:

Use a optimization algorithm like Adam or RMSprop which will make some efforts to improve the condition number of your matrix. You can learn more about those in Chapter 8 of https://www.deeplearningbook.org/.
Do you need to have A be a Gaussian matrix? A matrix with eigenvalues closer to 1 would reduce this problem.
There are optimization techniques (nothing to do with machine learning) that ameliorate the difficulties of a large condition number. You might look up preconditioned gradient descent for more information on this.

score 0 · Answer 2 · answered Dec 14 '20 at 06:47

0

I don't think there anything wrong in the optimization process, I think the problem is your misleading metrics rabs(y_t, y_p)

For the outputs of rabs(y_t, y_p) is same after MAE divide by (backend.max(y_t) - backend.min(y_t)), the Y of gen1 and Y of gen2 need in the same probability distribution, which is not the case here, since in gen1 your Y = np.dot(Ainv,np.random.rand(N,1)) and in gen2 Y = np.random.rand(N,1)

Simple example here is to consider y_true_1 = (0.1, 0.2, 0.3), y_true_2 = (0.1, 0.2, 0.5) and y_predict_1 = (0.0, 0.1, 0.2), y_predict_2 = (0.0, 0.1, 0.4), then MAE_1 = MAE_2 = 0.1, but after MAE_1 divide by (max(y_true_1) - min(y_true_1 )) the RMAE_1 = 0.5 and MAE_2 divide by (max(y_true_2) - min(y_true_2 )) the RMAE_2 = 0.25, you can see now why if distribution of y_true_1 is different from distribution of y_true_2, then you cannot expect two outputs of rabs(y_t, y_p) will be the same

I change the rabs(y_t, y_p) to MAS:

def rabs(y_t, y_p):
    return backend.mean(backend.abs(y_p - y_t))

And optimizer to:

learning_rate_fn = tf.keras.optimizers.schedules.InverseTimeDecay(1.0, 950 * 100, 9)
opt = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)

And I run it many time with epochs = 100, the outputs for both gen1() and gen2() is around:

gen1:
Epoch 1/100
950/950 [==============================] - 1s 625us/step - loss: 1631.5898 - rabs: 31.9912 - val_loss: 1568.4200 - val_rabs: 31.6044
Epoch 100/100
950/950 [==============================] - 1s 541us/step - loss: 16.1436 - rabs: 3.1877 - val_loss: 19.1974 - val_rabs: 3.5311

gen2:
Epoch 1/100
950/950 [==============================] - 1s 614us/step - loss: 51.9863 - rabs: 5.7896 - val_loss: 20.9347 - val_rabs: 3.5948
Epoch 100/100
950/950 [==============================] - 1s 540us/step - loss: 0.7340 - rabs: 0.6716 - val_loss: 0.5478 - val_rabs: 0.5920

As you can see the optimizer basically does the same job, it reduce the loss(MSE) by 100 times and rabs(MAE) by 10 times

answered Dec 14 '20 at 06:47

Mr. For Example

4,173
1
9
19

I used the relative absolute mean error because its more closely related to the weight errors. In the first case, the output varies between -100 to 100 while in the second one, it's between 0 to 1. Even if you get the same absolute error for both, the weights are tens of times more accurate for the case 1. Using your code, the MSEs for the weight matrix is 0.5 and 19 for the first and the second cases respectively. (weights should be equal to Ainv in both cases and their values range between -10to15) I think the problem arises from the collinearity of the input data in the second case. – Kami Dec 14 '20 at 19:05
I'm not say they should get same absolute error for both, the final result is determinate by how your initialize weights and other hyperparameters (you cannot expect using same hyperparameters to get same result, which my code above simply using same weight initialization process and same learning rate schedule for both case), I just simply point out using relative absolute mean error for different probability distribution of outputs is misleading, and the optimizer basically make the same progress from initial weights in both case – Mr. For Example Dec 15 '20 at 01:22
So in short, if you want both case have approximately same result (MAE with different scale), you should try two different weight initialization process for `Dense` layer and different `learning rate`, good luck on finding the good hyperparameters – Mr. For Example Dec 15 '20 at 01:42
This is a purely linear system and the fit that you obtained in the second case has 67% error. Do you not expect any better result after 100 Epochs when you know that the network should perfectly match the original system because they are identical (linear, same connections and same number of parameters)? There is an analytical and closed form solution for such regression problems. Or look at this way, if you have a matrix A, can you find the inverse of A with better than let's say 1% accuracy using the network and data I presented? (yours has 67% error, mine has 7% which is still not good). – Kami Dec 15 '20 at 20:55
The first case doesn't have this problem and achieves below 0.01% error easily with any optimizer but that's because the input data are independent and the generator literally uses Ainv. if you had used my optimizer or even the default Adam, you would get 0.0013 and 0.1 MAE for the 1st and 2nd cases in 100 epochs (roughly 0.0005% vs 10% relative error). They are not the same no matter if you look at the relative or absolute error. This is actually more of math and multi-collinearity problem. – Kami Dec 15 '20 at 20:56
Let me say it again: if you want both case have approximately same result (MAE with different scale), you should try two different weight initialization process for `Dense` layer and different `learning rate`, especially different init weights in `Dense` layer, you cannot expect both case start from the same points to get same result. but the same result is not guaranteed, because X and Y in both case is not in same distribution, they loss space maybe different, as for as I know there is no mathematical proof for what you want from optimizer – Mr. For Example Dec 16 '20 at 01:32
Check [DeepMind On optimizer ](https://www.youtube.com/watch?v=kVU8zTI-Od0) if you want learn more about optimization – Mr. For Example Dec 16 '20 at 01:32

Convergence issue with multi linear regression problems

2 Answers2

Why I think this is happening

Suggested Solutions