Gradient descent with pytorch results in high loss

Question

I am trying to implement the standard gradient descent algorithm with pytorch in order to perform dimensionality reduction (PCA) on the Indian Pines dataset. More specifically, I am trying to estimate matrix U1 that minimizes ||X-(U1 @ U1.T)@X||^2 where U1.T denotes the transpose of U1, @ denotes matrix multiplication, || denotes the Frobenius norm and X denotes the data (reconstruction error minimization).

For starters, I have vectorized the data and the variable indian_pines is of size torch.Size([220, 21025]) and I initilize U1 randomly with U1 = torch.rand(size=(220,150),dtype=tl.float64,requires_grad=True).

For the method itself, I have the following code:

n_iters = 100

learning_rate = 2e-9

for epoch in range(n_iters):
    
    #forward
    y_pred = torch.tensordot(U1 @ torch.t(U1),indian_pines,([0],[0]))

    #loss
    l = torch.norm(indian_pines - y_pred, 'fro')
    
    if epoch % 10 == 0: print(f'epoch: {epoch} loss: {l}')
    
    #gradient
    l.backward()
    
    #update
    with torch.no_grad():
        U1 -= learning_rate * U1.grad
        U1.grad.zero_()

with (example due to randomness) output:

epoch: 0 loss: 44439840488.652824
epoch: 10 loss: 27657067086.461464
epoch: 20 loss: 17353003250.14576
epoch: 30 loss: 10980377562.427532
epoch: 40 loss: 7000015690.042022
epoch: 50 loss: 4478747227.40419
epoch: 60 loss: 2847777701.784741
epoch: 70 loss: 1757431994.7743077
epoch: 80 loss: 990962121.4576876
epoch: 90 loss: 426658102.95583844

This loss seems to be very high and it gets even worse by increasing learning_rate. Of course decreasing it makes the loss function reduce in a much slower rate. My question is: Is there something wrong with the way I use autograd that results in such high loss? How could I improve quality? Thanks in advance.

The loss is steadily decreasing in your example, so that looks fine. High loss means that your model's fit is bad, but that's OK - gradient descent should improve it. How are you initializing `U1`? Parameter initialization can significantly affect training. — ForceBru, Feb 15 '22 at 18:15
Thank you for your comment. I am initiliazing `U1` randomly with `U1 = torch.rand(size=(220,150),dtype=tl.float64,requires_grad=True)`. That's probably it? — cchatzis, Feb 15 '22 at 18:17
I just ran a simple version of this code with your initialization and some randomly generated (`torch.randn(size=(220, 21025), dtype=torch.float64, requires_grad=False)`) data `X` aka `indian_pines` - the loss starts at about 1'195'633 and decreases to about 30'811 after 100 epochs as expected. I think the magnitude of the loss depends on the shape of `X` here: if the second dimension has a lot of data (21025 in your example), you're going to get a high initial loss. Feed in less data - you'll get a lower loss. IMHO, the main point is that the loss is successfully minimized, so the code works. — ForceBru, Feb 15 '22 at 18:37
@cchatz High loss is expected. PCA is a linear transformation and simply might not be able to model data’s complexity. You can only get to a point (given more epochs) if you want to reduce the dimensionality of more complex data you should go for autoencoders or at least T-SNE (usually used for visualization though). — Szymon Maszke, Feb 15 '22 at 19:22
@ForceBru just tried your example and it makes sense. Also, I just noticed the scale of elements of `indian_pines` played a role. I tried centering the data before gradient descent and it got me better results. Thanks! — cchatzis, Feb 16 '22 at 09:17
@SzymonMaszke will definitely look in your recommendations. Thank you! — cchatzis, Feb 16 '22 at 09:18
I would be more than happy to accept either of your comments as answers. — cchatzis, Feb 16 '22 at 09:21
@cchatz this question is actually not suitable for this stack, should be in Data Science or other ML/AI rrlated as its not strictly technical. You mighy want to move iy there or delete from here altogether in order not to introduce more noise. — Szymon Maszke, Feb 16 '22 at 15:08

Gradient descent with pytorch results in high loss

0 Answers0