Why is there an activation function in each neural net layer, and not just one in the final layer?

Question

I'm trying to teach myself machine learning and I have a similar question to this.

Is this correct:

For example, if I have an input matrix, where X1, X2 and X3 are three numerical features (e.g. say they are petal length, stem length, flower length, and I'm trying to label whether the sample is a particular flower species or not):

x1  x2  x3  label
5   1   2   yes
3   9   8   no
1   2   3   yes
9   9   9   no

That you take the vector of the first ROW (not column) of the table above to be inputted into the network like this:

i.e. there would be three neurons (1 for each value of the first table row), and then w1,w2 and w3 are randomly selected, then to calculate the first neuron in the next column, you do the multiplication I have described, and you add a randomly selected bias term. This gives the value of that node.

This is done for a set of nodes (i.e. each column actually will have four nodes (three + a bias), for simplicity, i removed the other three nodes from the second column), and then in the last node before the output, there is an activation function to transform the sum into a value (e.g. 0-1 for sigmoid) and that value tells you whether the classification is yes or no.

I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources. So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features, e.g. in this case, it would make sense to write:

from keras.models import Sequential
from keras.models import Dense

model = Sequential()
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(3,activation='softmax'))

What I don't understand is why the keras model has an activation function in each layer of the network and not just at the end, which is why I'm wondering if my understanding is correct/why I added the picture.

Edit 1: Just a note I saw that in the bias neuron, I put on the edge 'b=1', that might be confusing, I know the bias doesn't have a weight, so that was just a reminder to myself that the weight of the bias node is 1.

score 1 · Answer 1 · answered Apr 21 '20 at 21:22

It seems your question is why there is a activation function for each layer instead of just the last layer. The simple answer is, if there are no non-linear activations in the middle, no matter how deep your network is, it can be boiled down to a single linear equation. Therefore, non-linear activation is one of the big enablers that enable deep networks to be actually "deep" and learn high-level features.

Take the following example, say you have 3 layer neural network without any non-linear activations in the middle, but a final softmax layer. The weights and biases for these layers are (W1, b1), (W2, b2) and (W3, b3). Then you can write the network's final output as follows.

h1 = W1.x + b1
h2 = W2.h1 + b2
h3 = Softmax(W3.h2 + b3)

Let's do some manipulations. We'll simply replace h3 as a function of x,

h3 = Softmax(W3.(W2.(W1.x + b1) + b2) + b3)
h3 = Softmax((W3.W2.W1) x + (W3.W2.b1 + W3.b2 + b3))

In other words, h3 is in the following format.

h3 = Softmax(W.x + b)

So, without the non-linear activations, our 3-layer networks has been squashed to a single layer network. That's is why non-linear activations are important.

desertnaut · Accepted Answer · 2020-04-21T22:55:37.373

Several issues here apart from the question in your title, but since this is not the time & place for full tutorials, I'll limit the discussion to some of your points, taking also into account that at least one more answer already exists.

So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features,

No.

The number of features is passed in the input_dim argument, which is set only for the first layer of the model; the number of inputs for every layer except the first one is simply the number of outputs of the previous one. The Keras model you have written is not valid, and it will produce an error, since for your 2nd layer you ask for input_dim=3, while the previous one has clearly 6 outputs (nodes).

Beyond this input_dim argument, there is no other relationship whatsoever between the number of data features and the number of network nodes; and since it seems you have in mind the iris data (4 features), here is a simple reproducible example of applying a Keras model to them.

What is somewhat hidden in the Keras sequential API (which you use here) is that there is in fact an implicit input layer, and the number of its nodes is the dimensionality of the input; see own answer in Keras Sequential model input layer for details.

So, the model you have drawn in your pad actually corresponds to the following Keras model written using the sequential API:

model = Sequential()
model.add(Dense(1,input_dim=3,activation='linear'))

where in the functional API it would be written as:

inputs = Input(shape=(3,))                
outputs = Dense(1, activation='linear')(inputs)     

model = Model(inputs, outputs)

and that's all, i.e. it is actually just linear regression.

I know the bias doesn't have a weight

The bias does have a weight. Again, the useful analogy is with the constant term of linear (or logistic) regression: the bias "input" itself is always 1, and its corresponding coefficient (weight) is learned through the fitting process.

why the keras model has an activation function in each layer of the network and not just at the end

I trust this has been covered sufficiently in the other answer.

I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources.

We all did; no excuse though to not benefit from Andrew Ng's free & excellent Machine Learning MOOC at Coursera.

score 1 · Answer 3 · edited Apr 22 '20 at 01:28

Imagine, you have an activation layer only in the last layer (In your case, sigmoid. It can be something else too.. say softmax). The purpose of this is to convert real values to a 0 to 1 range for a classification sort of answer. But, the activation in the inner layers (hidden layers) has a different purpose altogether. This is to introduce nonlinearity. Without the activation (say ReLu, tanh etc.), what you get is a linear function. And how many ever, hidden layers you have, you still end up with a linear function. And finally, you convert this into a nonlinear function in the last layer. This might work in some simple nonlinear problems, but will not be able to capture a complex nonlinear function. Each hidden unit (in each layer) comprises of activation function to incorporate nonlinearity.

Why is there an activation function in each neural net layer, and not just one in the final layer?

3 Answers3