1

I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.

I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:

w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True)

learning_rate = 1e-6
for t in range(501):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    loss.backward()
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad
    w1.grad.zero_()
    w2.grad.zero_()

This works great. Then I construct similar code using actual modules:

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(501):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    model.zero_grad()
    loss.backward()
    for param in model.parameters():
        param.data -= learning_rate * param.grad

This also works great.

BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.

Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.

David Mertz
  • 313
  • 1
  • 10

1 Answers1

1

The crucial difference is the initialization of the weights. The weight matrix in a nn.Linear is initialized smart. I'm pretty sure that if you construct both the models and copy the weight matrices in one way or the other, you'll get consistent behavior.

Additionally, please note that the two models are not equivalent, as your handcrafted model lacks biases. Which matters.

dedObed
  • 1,313
  • 1
  • 11
  • 19
  • Thank you, that is helpful. I suspect the initialization is less important in this case than the absence of a bias in the manual case. But either way, this suffices to add to my explanation. I don't entirely need them to be identical, just to be able to explain why they are different. – David Mertz Mar 14 '19 at 19:59
  • 1
    Oh, I'd put most of my money on the initialization being the key ;-) Consult the backpropagation formula for what happens to your gradients when you multiply a weight matrix by a constant... while the objective function remains the same. And from instruction perspective, I think it'd be super cool to initialize the manual model carefully and observe how it suddenly becomes easier to train ;-) I've shot myself in the foot several times before finally learning to pay attention to the initialization. – dedObed Mar 14 '19 at 20:03