I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.
I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:
w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True)
learning_rate = 1e-6
for t in range(501):
y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
loss.backward()
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
w1.grad.zero_()
w2.grad.zero_()
This works great. Then I construct similar code using actual modules:
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(501):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
for param in model.parameters():
param.data -= learning_rate * param.grad
This also works great.
BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.
Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.