Training with parametric partial derivatives in pytorch

Question

Given a neural network with weights theta and inputs x, I am interested in calculating the partial derivatives of the neural network's output w.r.t. x, so that I can use the result when training the weights theta using a loss depending both on the output and the partial derivatives of the output. I figured out how to calculate the partial derivatives following this post. I also found this post that explains how to use sympy to achieve something similar, however, adapting it to a neural network context within pytorch seems like a huge amount of work and a recipee for very slow code.

Thus, I tried something different, which failed. As a minimal example, I created a function (substituting my neural network)

theta = torch.ones([3], requires_grad=True, dtype=torch.float32)
def trainable_function(time):
    return theta[0]*time**3 + theta[1]*time**2 + theta[2]*time

Then, I defined a second function to give me partial derivatives:

def trainable_derivative(time):
    deriv_time = torch.tensor(time, requires_grad=True)
    fun_value = trainable_function(deriv_time)
    gradient = torch.autograd.grad(fun_value, deriv_time, create_graph=True, retain_graph=True)
    deriv_time.requires_grad = False
    return gradient

Given some noisy observations of the derivatives, I now try to train theta. For simplicity, I create a loss that only depends on the derivatives. In this minimal example, the derivatives are used directly as observations, not as regularization, to avoid complicated loss functions that are besides the point.

def objective(train_times, observations):
    predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times]))
    return torch.sum((predictions - observations)**2)

optimizer = Adam([theta], lr=0.1)
for iteration in range(200):
    optimizer.zero_grad()
    loss = objective(data_times, noisy_targets)
    loss.backward()
    optimizer.step()

Unfortunately, when running this code, I get the error

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I suppose that when calculating the partial derivatives in the way I do, I do not really create a computational graph through which autodiff could differentiate through. Thus, the connection to the parameters theta somehow gets lost and now it looks to the optimizer as if the loss is completely independent of the parameters theta. However, I could be totally wrong..

Does anyone know how to fix this? Is it possible to include this type of derivatives in the loss function in pytorch? And if so, what would be the most pytorch-style way of doing this?

Many thanks for your help and advise, it is much appreciated.

For completeness:

To run the above code, some training data needs to be generated. I used the following code, which works perfectly and has been tested against the analytical derivatives:

true_a = 1
true_b = 1
true_c = 1


def true_function(time):
    return true_a*time**3 + true_b*time**2 + true_c*time


def true_derivative(time):
    deriv_time = torch.tensor(time, requires_grad=True)
    fun_value = true_function(deriv_time)
    return torch.autograd.grad(fun_value, deriv_time)

data_times = torch.linspace(0, 1, 500)
true_targets = torch.squeeze(torch.tensor([true_derivative(a) for a in data_times]))
noisy_targets = torch.tensor(true_targets) + torch.randn_like(true_targets)*0.1

score 0 · Accepted Answer · answered Feb 24 '21 at 21:12

Your approach to the problem appears overly complicated. I believe that what you're trying to achieve is within reach in PyTorch. I include here a simple code snippet that I believe showcases what you would like to do:

import torch
import torch.nn as nn

# Data and Function
torch.manual_seed(0) 
input_dim  = 1
output_dim = 2
n = 10 # batchsize
simple_function = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid())
t = (torch.arange(n).float() / n).view(n, 1) 
x = torch.randn(n, output_dim)
t.requires_grad = True

# Actual computation
xhat = simple_function(t)
jac = torch.autograd.functional.jacobian(simple_function, t, create_graph=True)
grad = jac[torch.arange(n),:,torch.arange(n),0]
loss = (x -xhat).pow(2).sum() + grad.pow(2).sum()
loss.backward()

Thanks for the feedback. Interestingly enough, the problem lies in the line predictions = torch.squeeze(torch.tensor([trainable_derivative(a) for a in train_times])) pytorch seems to have a problem with the list construction. If predictions is created using torch.zeros and then filled element-wise using a for loop, all works fine. You are right though, the approach is overly complicated, will need to look like this though in the bigger structure of our code framework. — theo, Feb 24 '21 at 21:14

Training with parametric partial derivatives in pytorch

1 Answers1