4

I've seen answers to this question, but I still don't understand it at all. As far as I know, this is the most basic setup:

net = CustomClassInheritingFromModuleWithDefinedInitAndForward()
criterion = nn.SomeLossClass()
optimizer = optim.SomeOptimizer(net.parameters(), ...)
for _, data in enumerate(trainloader, 0):
    inputs, labels = data
    optimizer.zero_grad()
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

What I don't understand is:

Optimizer is initialized with net.parameters(), which I thought are internal weights of the net.

Loss does not access these parameters nor the net itself. It only has access to net's outputs and input labels.

Optimizer does not access loss either.

So if loss only works on outputs and optimizer only on net.parameters, how can they be connected?

momo
  • 43
  • 4

1 Answers1

2

Optimizer is initialized with net.parameters(), which I thought are internal weights of the net.

This is because the optimizer will modify the parameters of your net during the training.

Loss does not access these parameters nor the net itself. It only has access to net's outputs and input labels.

The loss only computes an error between a prediction and the truth.

Optimizer does not access loss either.

It accesses the tensors that were computed during loss.backward

Thomas Schillaci
  • 2,348
  • 1
  • 7
  • 19
  • The way I understand it, the tensors loss has access to are outputs and labels, and the tensors optimizer has access to are net.parameters. What are the "tensors that were computed during loss.backward" that both optimizer and loss have access to? – momo Feb 19 '20 at 09:19
  • The loss is only the error between your prediction and your ground truth (that can be labels, yes). When you are doing loss.backward, you are doing a back-propagation, which is a computation of how every parameter of your model is responsible for the error. The optimizer uses this backpropagation to update the parameters of your network. I would recommend you to follow PyTorch's tutorial https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html which explains it very well. – Thomas Schillaci Feb 19 '20 at 09:24
  • I did try to follow this tutorial. I'm afraid I did not understand that part. How does loss access my model? I do not give `net` or `net.parameters` as an argument at any point, the way I do for the optimizer. So how can `loss.backward()` calculate how every parameter of my model is responsible for the error, if it doesn't know what my model is? – momo Feb 19 '20 at 09:30
  • Your loss doesn't access your model, it accesses a *history* of values provided within the `outputs` tensor. – Thomas Schillaci Feb 19 '20 at 09:33
  • Ok, but then how does optimizer access the result of `backward()`? In the tutorial you linked, it says that "...you can call .backward() and have all the gradients computed automatically. The gradient for this tensor will be accumulated into .grad attribute.". So the gradient is accumulated into .grad attribute of the `outputs` tensor. But optimizer does not have access to `outputs`, so how can it apply this gradient to `net.parameters`? – momo Feb 19 '20 at 09:38
  • Alright, I have not been clear enough. The gradient is actually accumuated in the `.grad` attribute of every parameter of your net ; which are accessed because your `outputs` has a "history" of how it was computed. – Thomas Schillaci Feb 19 '20 at 09:42
  • So loss.backward() accesses net.parameters through outputs' history and stores the result in net.parameters' .grad attribute - and then optimizer applies the results stored in .grad to net.parameters themselves? For example, I could chain a couple netwoks together (so that one's output is next one's input) and a single loss.backward() would then calculate necessary gradients for every network, but I would still need a separate optimizer for each one to actually apply these results? – momo Feb 19 '20 at 09:52
  • At least, this is how I understood how pytorch's autograd work yes. For your chaining example, this is a very specific question that would probably need a separate post but I think that this would be possible through some tricking. – Thomas Schillaci Feb 19 '20 at 09:55
  • I see. Thank you. I did not know that you can access other tensors through computation history. The chaining example wasn't supposed to be a specific question, rather just me trying to check if I correctly understand how it all works. But perhaps I will ask it regardless. – momo Feb 19 '20 at 10:02
  • I'm not sure you can **explicitly** access the history, this may be hidden behind autograd's api. – Thomas Schillaci Feb 19 '20 at 10:03
  • 1
    @momo Every time an operation is performed on a tensor (e.g. add, convolution, etc...) information about the operation is stored in the resulting tensor instance, including a reference to the tensors that the result is derived from. In this way a result can be traced back to the originating parameters and inputs to a model. When you run .backward on the final loss tensor the connections are followed back to the originating (leaf) tensors, propagating gradients and ultimately storing the results into the .grad members. The optimizer applies updates to the parameters using the resulting .grad. – jodag Feb 19 '20 at 12:51