Computing the partial derivatives of a deep neural network with respect to Inputs

Question

I am trying to compute the derivative of a neural network with 2 or more hidden layers with respect to its inputs. So not "standard backpropagation" since I am not interested in how the output varies with respect to the weights. And I am not looking to train my network using it (if this warrants removing the backpropagation tag, let me know, but I suspect that what I need is not too different)

The reason for my interest in the derivative here, is that I have a test set which sometimes provides me with a matching [x1, x2] : [y] pair, and sometimes a [x1, x2] : [d(y)/dx1] or [x1, x2] : [d(y)/dx2]. I then use a particle swarm algorithm to train my network.

I like diagrams, so to save a few words here is my network:

and what I would like is for the compute_derivativemethod to return a numpy array of the form below:

This is my attempt so far, but I can't seem to get an array matching my number of inputs at the end. I can't figure what I am doing wrong.

def compute_derivative(self):
"""Computes the network derivative and returns an array with the change in output with respect to each input"""
    self.compute_layer_derivative(0)
    for l in np.arange(1,self.size):
        dl = self.compute_layer_derivative(l)
        dprev = self.layers[l-1].derivatives
        self.output_derivatives = dl.T.dot(dprev)

    return self.output_derivatives

def compute_layer_derivative(self, l_id):
    wL = self.layers[l_id].w
    zL = self.layers[l_id].output
    daL = self.layers[l_id].f(zL, div=1)
    daLM = np.repeat(daL,wL.shape[0], axis=0)

    self.layers[l_id].derivatives = np.multiply(daLM,wL)

    return self.layers[l_id].derivatives

If you want to run the entire code I have made a cut down, commented version, which will work with a copy paste (see below). Thank you for your help !

# -*- coding: utf-8 -*-

import numpy as np

def sigmoid(x, div = 0):
    if div == 1: #first derivative f'
        return np.exp(-x) / (1. + np.exp(-x))**2.
    if div == 2: # second derivative f''
        return - np.exp(x) * (np.exp(x) - 1) / (1. + np.exp(x))**3.
    return 1. / (1. + np.exp(-x)) # f

def linear(x, div = 0):
    if div == 1: #first derivative f'
        return np.full(x.shape,1)
    if div > 2:  # second derivative f''
        return np.zeros(x.shape)
    return x # f

class Layer():
    def __init__(self, in_n, h_n, activation, bias = True, debug = False):
        self.w = 2*np.random.random((in_n, h_n)) - 1 # synaptic weights with 0 mean
        self.f = activation
        self.output = None
        self.activation = None
        self.derivatives = np.array([[None for i in range(in_n+1)]]) #+1 for global dev
        if bias:
            self.b = 2*np.random.random((1, h_n)) - 1
        else:
            self.b = None

        if debug:
            self.w = np.full((in_n, h_n), 1.)
            if self.b is not None: self.b = np.full((1, h_n), 1.)

    def compute(self, inputs):
        if self.w.shape[0] != inputs.shape[1]:
            raise ValueError("Inputs dimensions do not match test data dim.")
        if self.b is None:
            self.output = np.dot(inputs, self.w)
        else:
            self.output = np.dot(inputs, self.w) + self.b

        self.activation = self.f(self.output)

class NeuralNetwork():
    def __init__(self, nb_layers, in_NN, h_density, out_NN, debug = False):
        self.debug = debug
        self.layers = []
        self.size = nb_layers+1  
        self.output_derivatives = None
        self.output = None
        self.in_N = in_NN
        self.out_N = out_NN
        if debug: 
            print("Input Layer with {} inputs.".format(in_NN))

        #create hidden layers
        current_inputs = in_NN
        for l in range(self.size - 1):
            self.layers.append(Layer(current_inputs, h_density, sigmoid, debug = debug))
            current_inputs = h_density
            if debug:
                print("Hidden Layer {} with {} inputs and {} neurons.".format(l+1, self.layers[l].w.shape[0], self.layers[l].w.shape[1]))
        #creat output layer
        self.layers.append(Layer(current_inputs, out_NN, linear, bias=False, debug = debug))
        if debug:
            print("Output Layer with {} inputs and {} outputs.".format(self.layers[-1].w.shape[0], self.layers[-1].w.shape[1]))
            #print("with w: {}".format(self.layers[l].w))
            print("ANN size = {}, with {} Layers\n\n".format( self.size, len(self.layers)))

    def compute(self, point):
        curr_inputs = point
        for l in range(self.size):
            self.layers[l].compute(curr_inputs)
            curr_inputs = self.layers[l].activation
        self.output = curr_inputs
        if self.debug: print("ANN output: ",curr_inputs)
        return self.output

    def compute_derivative(self, order, point):
        """ If the network has not been computed, compute it before getting
            the derivative. This might be a bit expensive..."""
        if self.layers[self.size-1].output is None:
            self.compute(point)

        #Compute output layer total derivative
        self.compute_layer_derivative(self.size-1, order)
        self.output_derivatives = self.get_partial_derivatives_to_outputs(self.size-1)
        print(self.output_derivatives)

        for l in np.arange(1,self.size):
            l = self.size-1 - l
            self.compute_layer_derivative(l, order)
            if l > 0: #if we are not at first hidden layer compute the total derivative
                self.output_derivatives *= self.get_total_derivative_to_inputs(l)
            else:# get the each output derivative with respect to each input
                backprop_dev_to_outs = np.repeat(np.matrix(self.output_derivatives),self.in_N, axis=0).T
                dev_to_inputs = np.repeat(np.matrix(self.get_partial_derivatives_to_inputs(l)).T,self.out_N, axis=1).T
                self.output_derivatives = np.multiply(backprop_dev_to_outs, dev_to_inputs)

            if self.debug: print("output derivatives: ",self.output_derivatives)
        return self.output_derivatives

    def get_total_derivative(self,l_id):
        return np.sum(self.get_partial_derivatives_to_inputs(l_id))

    def get_total_derivative_to_inputs(self,l_id):
        return np.sum(self.get_partial_derivatives_to_inputs(l_id))

    def get_partial_derivatives_to_inputs(self,l_id):
        return np.sum(self.layers[l_id].derivatives, axis=1)    

    def get_partial_derivatives_to_outputs(self,l_id):
        return np.sum(self.layers[l_id].derivatives, axis=0)

    def compute_layer_derivative(self, l_id, order):  
        if self.debug: print("\n\ncurrent layer is ", l_id)
        wL = self.layers[l_id].w
        zL = self.layers[l_id].output
        daL = self.layers[l_id].f(zL, order)
        daLM = np.repeat(daL,wL.shape[0], axis=0)

        self.layers[l_id].derivatives = np.multiply(daLM,wL)

        if self.debug:
            print("L_id: {}, a_f: {}".format(l_id, self.layers[l_id].f))
            print("L_id: {}, dev: {}".format(l_id, self.get_total_derivative_to_inputs(l_id)))

        return self.layers[l_id].derivatives

#nb_layers, in_NN, h_density, out_NN, debug = False
nn = NeuralNetwork(1,2,2,1, debug= True)
nn.compute(np.array([[1,1]]))# head value
nn.compute_derivative(1,np.array([[1,1]])) #first derivative

EDITED ANSWER BASED ON SIRGUY's REPLY:

# Here we assume that the layer has sigmoid activation
def Jacobian(x = np.array([[1,1]]), w = np.array([[1,1],[1,1]]), b = np.array([[1,1]])):
    return sigmoid_d(x.dot(w) + b) * w # J(S, x)

In the case of a network with 2 hidden layers with sigmoid activation and one output layer with sigmoid activation (so that we can just use the same function as above) we have:

J_L1 =  Jacobian(x = np.array([[1,1]])) # where [1,1] are the inputs of to the network (i.e. values of the neuron in the input layer)
J_L2 =  Jacobian(x = np.array([[3,3]])) # where [3,3] are the neuron values of layer 1 before activation
# in the output layer the weights and biases are adjusted as there is 1 neuron rather than 2
J_Lout = Jacobian(x = np.array([[2.90514825, 2.90514825]]), w = np.array([[1],[1]]), b = np.array([[1]]))# where [2.905,2.905] are the neuron values of layer 2 before activation
J_out_to_in = J_Lout.T.dot(J_L2).dot(J_L1)

I answered [this question](https://stackoverflow.com/a/37791611/1277769) for how the derivative works. You want to do that in combination with the chain rule to get the final expression. Is that hint enough? — SirGuy, Sep 05 '18 at 15:28
Thank you @SirGuy. So in my case the `p` function would be the activation function (sigmoid or linear) right ? The indices `i` and `j` are the neurons of the current layer and previous respectively ? I will give it a go and get back to you ! — Sorade, Sep 05 '18 at 16:09
Hi @SirGuy I have managed to get it to work ... almost ! I think I am doing something wrong with my output layer which has no sigmoid activation function, because I end up with a derivative which is always multiplied by the weight of the final layer. Say I have 2 neurons between my L_output and L_-1 with a weight of 1, then my derivative is two large by a factor of 2. — Sorade, Sep 06 '18 at 15:37
In your code you compute the `'total derivative'` as the sum of the partial derivatives. This is incorrect. The total derivative *is* the Jacobian matrix. — SirGuy, Sep 06 '18 at 17:36
Yes ! I have just realised that. I am unclear what I should do with the Jacobian Matrix (which is the result of `np.multiply(daLM,wL)` right ? I can imagine that if the Jacobian from each layer is "stacked" on top of the one from the previous layer some terms will start to for make part of a chain rule. I am unclear about what happens is the matrix changes size between layers though. (Sorry my linear algebra is not good so as soon as things start to become vectors and matrices I really struggle) — Sorade, Sep 06 '18 at 18:04

SirGuy · Accepted Answer · 2018-09-11T17:13:16.850

Here's how I derived what your example should give:

# i'th component of vector-valued function S(x) (sigmoid-weighted layer)
S_i(x) = 1 / 1 + exp(-w_i . x + b_i) # . for matrix multiplication here

# i'th component of vector-valued function L(x) (linear-weighted layer)
L_i(x) = w_i . x # different weights than S.
# as it happens our L(x) output 1 value, so is in fact a scalar function

F(x) = L(S(x)) # final output value

#derivative of F, denoted as J(F, x) to mean the Jacobian of the function F, evaluated at x.
J(F, x) = J(L(S(x)), x) = J(L, S(x)) . J(S, x) # chain rule for multivariable, vector-valued functions

#First, what's the derivative of L?
J(L, S(x)) = L

This is usually a surprising result, but you can verify this yourself by computing partial derivatives of M . x for some random matrix M. If you compute all the derivatives and put them into the Jacobian you will get back M.

#Now what's the derivative of S? Compute via formula
d(S_i(x)/dx_j) = w_ij * exp(-w_i.x+b_i) / (1 + exp(-w_i.x+b_i))**2 #w_ij, is the j'th component of the vector w_i
#For the gradient of a S_i (which is just one component of S), we get
J(S_i, x) = (exp(-w_i . x + b_i) / (1 + exp(-w_i . x + b_i))**2) * w_i # remember this is a vector because w_i is a vector

Now to take your debug example of 1's everywhere.

w_i = b = x = [1, 1]

#define a to make this less cluttered
a = exp(-w_i . x + b) = exp(-3)

J(S_i, x) = a / (1 + a)^2 * [1, 1]
J(S, x) = a / (1 + a)^2 * [[1, 1], [1, 1]]
J(L, S(x)) = [1, 1] #Doesn't depend on S(x)

J(F, x) = J(L, S(x)) . J(S, x) = (a / (1 + a)**2) * [1, 1] . [[1, 1], [1, 1]]
J(F, x) = (a / (1 + a)**2) * [2, 2] = (2 * a / (1 + a)**2) * [1, 1]
J(F, x) = [0.0903533, 0.0903533]

Hopefully this will help you reorganise your code a bit. You can't evaluate the derivatives here with just the value of w_i . x, you will need w_i and x separately to properly compute everything.

EDIT

Because I find this stuff interesting, here is my python script for computing the value and first derivative of a neural network:

import numpy as np

class Layer:
    def __init__(self, weights_matrix, bias_vector, sigmoid_activation = True):
        self.weights_matrix = weights_matrix
        self.bias_vector = bias_vector
        self.sigmoid_activation = sigmoid_activation

    def compute_value(self, x_vector):
        result = np.add(np.dot(self.weights_matrix, x_vector), self.bias_vector)
        if self.sigmoid_activation:
            result = np.exp(-result)
            result = 1 / (1 + result)

        return result

    def compute_value_and_derivative(self, x_vector):
        if not self.sigmoid_activation:
            return (self.compute_value(x_vector), self.weights_matrix)
        temp = np.add(np.dot(self.weights_matrix, x_vector), self.bias_vector)
        temp = np.exp(-temp)
        value = 1.0 / (1 + temp)
        temp = temp / (1 + temp)**2
        #pre-multiplying by a diagonal matrix multiplies each row by
        #the corresponding diagonal element
        #(1st row with 1st value, 2nd row with 2nd value, etc...)
        jacobian = np.dot(np.diag(temp), self.weights_matrix)
        return (value, jacobian)

class Network:
    def __init__(self, layers):
        self.layers = layers

    def compute_value(self, x_vector):
        for l in self.layers:
            x_vector = l.compute_value(x_vector)

        return x_vector

    def compute_value_and_derivative(self, x_vector):
        x_vector, jacobian = self.layers[0].compute_value_and_derivative(x_vector)
        for l in self.layers[1:]:
            x_vector, j = l.compute_value_and_derivative(x_vector)
            jacobian = np.dot(j, jacobian)

        return x_vector, jacobian

#first weights
l1w = np.array([[1,1],[1,1]])
l1b = np.array([1,1])

l2w = np.array([[1,1],[1,1]])
l2b = np.array([1,1])

l3w = np.array([1, 1])
l3b = np.array([0])

nn = Network([Layer(l1w, l1b),
              Layer(l2w, l2b),
              Layer(l3w, l3b, False)])

r = nn.compute_value_and_derivative(np.array([1,1]))
print r

Thank you. I am still struggling to see how that would be extended for two hidden layers with sigmoid activation. My understanding is that it should be : `J(L, S(S(x))) . J(S, S(x)) . J(S, x)` but I can't get my head around how that can be computed. Should it be `J(S, x)` multiplied by the weights between the two layers ? In which case just extend `J(F, x)` to `J(L, S(S(x))) . J(S, S(x)) . J(S, x) = (a / (1 + a)**2) * [1, 1] . [[1, 1], [1, 1]] . [[1, 1], [1, 1]]` ?? — Sorade, Sep 10 '18 at 10:50
@Sorade focus on what each jacobian matrix is supposed to be individually, you can also just introduce new variables to help, `x_0` is the input layer and `x_(i+1) = S_i(x_i)`, then the Jacobian would be `J(L, x_n) . J(S_(n-1), x_(n-1)) ... J(S_0, x_0)`. Look at each Jacboian of `S_i` individually as a function of its own variables `x_i`. Doing this you'll see that `J(S_i, x_i) = a_i / (1 + a_i)**2 * W_i`, where `W_i` represents the matrix of weights (which become [[1,1],[1,1]] for the debug example). Hope that helps! — SirGuy, Sep 10 '18 at 12:54
I think I'm not too far off, I've got some of the right value appearing but I think I'm getting mixed up with dot products and multiplications. Just out of interest why is a constant here : `J(S, x) = a / (1 + a)^2 * [[1, 1], [1, 1]]` shouldn't it vary at each neuron in the hidden layer? — Sorade, Sep 10 '18 at 14:45
oh, it simplified to that because each neuron had the same weights. If you work it out with different weights then probably nothing will be factorable out of the matrix. — SirGuy, Sep 11 '18 at 01:25
I have edited my question with a python test version at the end. It seems to work. I was also wondering, for the case of taking the second derivative of the network, since the second derivative of the linear function is zero, does that imply that the second derivative of the network output with respect to its inputs is always zero ?! — Sorade, Sep 11 '18 at 10:13
@Sorade no, the first derivative is constant, but it is multiplied by non-constant functions. Look at the example `d**2(exp(2x))/dx**2 = d(2exp(2x)) / dx = 4exp(2x)`. As you can see, even though the chain rule added a constant term the second derivative is not 0. — SirGuy, Sep 11 '18 at 13:50
So in my Edited question, at the very bottom, when I do: `J_L2 = Jacobian(x = np.array([[3,3]])) # where [3,3] are the neuron values of layer 1 before activation` I should actually replace `[3,3]` by `[S(3),S(3)]` ? — Sorade, Sep 11 '18 at 14:06
@Sorade I think so? Honestly I find your code really hard to follow. Denoting the Jacobian matrix of a function S evaluated at x by J(S, x), then the chain rule says the following: `J(S2(S1), x) = J(S2, S1(x)) . J(S1, x)`. In this case `S1(x)` is the input to the second layer, and so you plug in those inputs to the Jacobian of S2. — SirGuy, Sep 11 '18 at 14:55
Thank you. Would it be possible for you to edit your answer to include a working piece of python code for the network described in the question image ? For the moment I find it very hard to verify my code as the only output I have to go with is that of a neural network with one hidden layer. — Sorade, Sep 11 '18 at 15:01
@Sorade how do you expect to represent the 2nd derivatives of the neural network? There is the Hessian Matrix, but I think that computing that in a generic neural network will require tensors. — SirGuy, Sep 11 '18 at 17:17
thank you for updating. The paper I base my work on(Aarts, L.P. & van der Veer, P. Neural Processing Letters (2001) 14: 261.) expresses the 2nd derivative of a network with 1 input, 1 hidden sigmoid layer and one linear output layer as: `d^2y/dx^2 = Sum(a_i*w_i^2*S''(w_i*x+b_i))` for i being the index of the neurons in the hidden layer. So in my problem (with 2 inputs, x1, x2) I would only be interested in `d^2y/dx1^2` and `d^2y/dx2^2`. From that paper I suspect that the 2nd derivatives at a sigmoid layer are `w_i^2*S''(w_i*x+b_i)`. Can the same chain rule principal be applied here? — Sorade, Sep 12 '18 at 10:27
To compute the second derivative in a generalised way, you're going to end up with tensor calculus, and I can't find any reference that just talks about a use case like yours. If you go that route you're going to have read up on the subject in general. You might be tempted to brute force the algebra there, but I very seriously doubt that that will be easier. I've gone as far into this as I'm willing to go, and wish you good luck with the rest. — SirGuy, Sep 12 '18 at 14:08
Thanks for the pointers. I think that this second derivative question is out of the topic for this question anyway. Cheers ! — Sorade, Sep 12 '18 at 14:10

Computing the partial derivatives of a deep neural network with respect to Inputs

1 Answers1