Why must a nonlinear activation function be used in a backpropagation neural network?

Question

I've been reading some things on neural networks and I understand the general principle of a single layer neural network. I understand the need for aditional layers, but why are nonlinear activation functions used?

This question is followed by this one: What is a derivative of the activation function used for in backpropagation?

doug · Accepted Answer · 2016-10-13T04:42:17.513

208

The purpose of the activation function is to introduce non-linearity into the network

in turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line--the word for this is affine).

another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

>>> in_vec = NP.random.rand(10)
>>> in_vec
  array([ 0.94,  0.61,  0.65,  0.  ,  0.77,  0.99,  0.35,  0.81,  0.46,  0.59])

>>> # common activation function, hyperbolic tangent
>>> out_vec = NP.tanh(in_vec)
>>> out_vec
 array([ 0.74,  0.54,  0.57,  0.  ,  0.65,  0.76,  0.34,  0.67,  0.43,  0.53])

A common activation function used in backprop (hyperbolic tangent) evaluated from -2 to 2:

enter image description here

edited Oct 13 '16 at 04:42

answered Mar 20 '12 at 09:02

doug

69,080
24
165
199

21

Why would we want to eliminate linearity? – corazza Mar 20 '12 at 10:02
22

If the data we wish to model is non-linear then we need to account for that in our model. – doug Mar 20 '12 at 10:10
54

One sentence answer: *<>*. Nice! – Autonomous May 23 '15 at 00:57
14

This is a little misleading - as eski mentioned, rectified linear activation functions are extremely successful, and if our goal is just to model/approximate functions, eliminating non-linearity at all steps isn't necessarily the right answer. With enough linear pieces, you can approximate almost any non-linear function to a high degree of accuracy. I found this a good explanation of why rectified linear units work: http://stats.stackexchange.com/questions/141960/deep-neural-nets-relus-removing-non-linearity – tegan Aug 03 '15 at 15:20
Although affine activation functions wouldn't have been very exciting either (even though they are non-linear), because that means the output is also just an affine transformation of the input (i.e. a linear transformation plus a constant) – HelloGoodbye Jan 15 '16 at 16:52
4

@doug: Is your answer equivalent to saying that the activation function allows the neural network to produce a non-linear decision boundary? – stackoverflowuser2010 Aug 04 '16 at 05:16
4

@stackoverflowuser2010: yes. – doug Aug 04 '16 at 23:43
23

@tegan **Rectified** linear activation functions are non-linear. I'm not sure what your comment has to do with the answer. – endolith Aug 06 '18 at 16:11
1

ReLU is piecewise linear, which makes both points. – vwvan Oct 10 '22 at 21:19
@corazza and others who are learning this, the point is how many things can you model with a linear relationship? Even something as simple as tossing a ball up in the air and measuring its y position over time is (approximately, forget stuff like General Relativity) quadratic, which is more complicated than linear. So if you're trying to model the relationship between something much more complicated, like between income and reported happiness, you'd better bet your model needs to deal with weird wiggles in the data – Nathan majicvr.com Jan 21 '23 at 15:32

score 71 · Answer 2 · edited Jun 04 '20 at 19:13

A linear activation function can be used, however on very limited occasions. In fact to understand activation functions better it is important to look at the ordinary least-square or simply the linear regression. A linear regression aims at finding the optimal weights that result in minimal vertical effect between the explanatory and target variables, when combined with the input. In short, if the expected output reflects the linear regression as shown below then linear activation functions can be used: (Top Figure). But as in the second figure below linear function will not produce the desired results:(Middle figure). However, a non-linear function as shown below would produce the desired results:

Activation functions cannot be linear because neural networks with a linear activation function are effective only one layer deep, regardless of how complex their architecture is. Input to networks is usually linear transformation (input * weight), but real world and problems are non-linear. To make the incoming data nonlinear, we use nonlinear mapping called activation function. An activation function is a decision making function that determines the presence of a particular neural feature. It is mapped between 0 and 1, where zero means absence of the feature, while one means its presence. Unfortunately, the small changes occurring in the weights cannot be reflected in the activation values because it can only take either 0 or 1. Therefore, nonlinear functions must be continuous and differentiable between this range. A neural network must be able to take any input from -infinity to +infinite, but it should be able to map it to an output that ranges between {0,1} or between {-1,1} in some cases - thus the need for activation function. Non-linearity is needed in activation functions because its aim in a neural network is to produce a nonlinear decision boundary via non-linear combinations of the weight and inputs.

+One, Then it can be deduced that nonlinear function is used to establish a perfect boundary? — Learner, Apr 16 '16 at 17:07
Yes, exactly. In steady of just producing 0 or 1 it can produce 0.4 or 0.78, making it continuous over the range of boundary. — chibole, Apr 18 '16 at 07:18
A neural network must be able to take any input from -infinity to +infinite, but it should be able to map it to an output that ranges between {0,1} or between {-1,1}...it reminds me that ReLU limitation is that it should only be used within Hidden layers of a Neural Network Model. — Cloud Cho, Feb 16 '18 at 06:10

HelloGoodbye · Answer 3 · 2019-01-15T16:16:26.900

26

If we only allow linear activation functions in a neural network, the output will just be a linear transformation of the input, which is not enough to form a universal function approximator. Such a network can just be represented as a matrix multiplication, and you would not be able to obtain very interesting behaviors from such a network.

The same thing goes for the case where all neurons have affine activation functions (i.e. an activation function on the form f(x) = a*x + c, where a and c are constants, which is a generalization of linear activation functions), which will just result in an affine transformation from input to output, which is not very exciting either.

A neural network may very well contain neurons with linear activation functions, such as in the output layer, but these require the company of neurons with a non-linear activation function in other parts of the network.

Note: An interesting exception is DeepMind's synthetic gradients, for which they use a small neural network to predict the gradient in the backpropagation pass given the activation values, and they find that they can get away with using a neural network with no hidden layers and with only linear activations.

edited Jan 15 '19 at 16:16

answered Jan 15 '16 at 17:24

HelloGoodbye

3,624
8
42
57

1

Higher order functions can be approximated with linear activation functions using multiple hidden layers. The universal approximation theorem is specific to MLPs with only one hidden layer. – eski Jan 15 '16 at 18:01
Actually, I believe you are correct in your statement about affine activation functions resulting in an affine transformation, but the fact that the transformation is learned through backpropagation (or any other means) makes it not entirely useless as far as the original question is concerned. – eski Jan 15 '16 at 19:06
6

@eski No, you can _not_ approximate higher order functions with only linear activation functions, you can only model linear (or affine, if you have an additional constant node in each but the last layer) functions and transformations, no matter how many layers you have. – HelloGoodbye Jan 17 '16 at 11:08
Is it correct to say that the activation function's main purpose is to allow the neural network to produce a non-linear decision boundary? – stackoverflowuser2010 Aug 04 '16 at 05:17
@stackoverflowuser2010 That would be one way to look at it. But there are more to an activation function than so. Wikipedia's article about [activation functions](https://en.wikipedia.org/wiki/Activation_function) lists several activation functions, all (but one) of which are nonlinear, and compares different qualities that an activation function can have. – HelloGoodbye Aug 04 '16 at 15:40
@stackoverflowuser2010 For example, sigmoid functions – logistic, tanh, arctan, softsign, etc. – are nonlinear and were used almost exclusively ten years ago, but then rectified linear units (RELUs) started to be used and were shown to be superior to sigmoid units in many cases. However, logistic units are still used in cases where you need the values to be between 0 and 1, such as in binary classifiers, or when you need "switches," such as in LSTM-layers. You also have the softmax nonlinearity, which operates on vectors and are used in classification problems that have more than two classes. – HelloGoodbye Aug 04 '16 at 16:06
@HelloGoodbye: Softmax is used as a loss function, not an activation function. – stackoverflowuser2010 Aug 04 '16 at 17:31
@stackoverflowuser2010 A loss function needs to be scalar-valued in order for gradient descent to work, but softmax is vector-valued, so it cannot be used as a loss function by itself, even though it is often used in conjunction with one (for multi-class classification, the most common loss function is probably categorical cross-entropy, but you can also use hinge loss or even square loss). I guess whether softmax [is an activation function](https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_Functions#Softmax_Function) or not depends on how you look at it. – HelloGoodbye Aug 05 '16 at 11:40
@HelloGoodbye: Yes, softmax is the squashing function, and cross-entropy is the loss function. Usually they go together. – stackoverflowuser2010 Aug 05 '16 at 17:14

score 26 · Answer 4 · answered Feb 03 '19 at 13:14

A feed-forward neural network with linear activation and any number of hidden layers is equivalent to just a linear neural neural network with no hidden layer. For example lets consider the neural network in figure with two hidden layers and no activation

y = h2 * W3 + b3 
  = (h1 * W2 + b2) * W3 + b3
  = h1 * W2 * W3 + b2 * W3 + b3 
  = (x * W1 + b1) * W2 * W3 + b2 * W3 + b3 
  = x * W1 * W2 * W3 + b1 * W2 * W3 + b2 * W3 + b3 
  = x * W' + b'

We can do the last step because combination of several linear transformation can be replaced with one transformation and combination of several bias term is just a single bias. The outcome is same even if we add some linear activation.

So we could replace this neural net with a single layer neural net.This can be extended to n layers. This indicates adding layers doesn't increase the approximation power of a linear neural net at all. We need non-linear activation functions to approximate non-linear functions and most real world problems are highly complex and non-linear. In fact when the activation function is non-linear, then a two-layer neural network with sufficiently large number of hidden units can be proven to be a universal function approximator.

score 9 · Answer 5 · edited Aug 15 '20 at 17:59

Several good answers are here. It will be good to point out the book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. It is a book worth referring to for getting a deeper insight about several ML related concepts. Excerpt from page 229 (section 5.1):

If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transformations that the network can generate are not the most general possible linear transformations from inputs to outputs because information is lost in the dimensionality reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units.

score 6 · Answer 6 · edited Mar 21 '18 at 08:01

"The present paper makes use of the Stone-Weierstrass Theorem and the cosine squasher of Gallant and White to establish that standard multilayer feedforward network architectures using abritrary squashing functions can approximate virtually any function of interest to any desired degree of accuracy, provided sufficently many hidden units are available." (Hornik et al., 1989, Neural Networks)

A squashing function is for example a nonlinear activation function that maps to [0,1] like the sigmoid activation function.

score 3 · Answer 7 · answered May 16 '18 at 17:41

There are times when a purely linear network can give useful results. Say we have a network of three layers with shapes (3,2,3). By limiting the middle layer to only two dimensions, we get a result that is the "plane of best fit" in the original three dimensional space.

But there are easier ways to find linear transformations of this form, such as NMF, PCA etc. However, this is a case where a multi-layered network does NOT behave the same way as a single layer perceptron.

score 3 · Answer 8 · edited Aug 15 '20 at 17:52

Neural Networks are used in pattern recognition. And pattern finding is a very non-linear technique.

Suppose for the sake of argument we use a linear activation function y=wX+b for every single neuron and set something like if y>0 -> class 1 else class 0.

Now we can compute our loss using square error loss and back propagate it so that the model learns well, correct?

WRONG.

For the last hidden layer, the updated value will be w{l} = w{l} - (alpha)*X.
For the second last hidden layer, the updated value will be w{l-1} = w{l-1} - (alpha)*w{l}*X.
For the ith last hidden layer, the updated value will be w{i} = w{i} - (alpha)*w{l}...*w{i+1}*X.

This results in us multiplying all the weight matrices together hence resulting in the possibilities: A)w{i} barely changes due to vanishing gradient B)w{i} changes dramatically and inaccurately due to exploding gradient C)w{i} changes well enough to give us a good fit score

In case C happens that means that our classification/prediction problem was most probably a simple linear/logistic regressor based one and never required a neural network in the first place!

No matter how robust or well hyper tuned your NN is, if you use a linear activation function, you will never be able to tackle non-linear requiring pattern recognition problems

score 3 · Answer 9 · answered Aug 15 '20 at 17:30

It is important to use the nonlinear activation function in neural networks, especially in deep NNs and backpropagation. According to the question posed in the topic, first I will say the reason for the need to use the nonlinear activation function for the backpropagation.

Simply put: if a linear activation function is used, the derivative of the cost function is a constant with respect to (w.r.t) input, so the value of input (to neurons) does not affect the updating of weights. This means that we can not figure out which weights are most effective in creating a good result and therefore we are forced to change all weights equally.

Deeper: In general, weights are updated as follows:

W_new = W_old - Learn_rate * D_loss

This means that the new weight is equal to the old weight minus the derivative of the cost function. If the activation function is a linear function, then its derivative w.r.t input is a constant, and the input values have no direct effect on the weight update.

For example, we intend to update the weights of last layer neurons using backpropagation. We need to calculate the gradient of the weight function w.r.t weight. With chain rule we have:

h and y are (estimated) neuron output and actual output value, respectively. And x is the input of neurons. grad (f) is derived from the input w.r.t activation function. The value calculated above (by a factor) is subtracted from the current weight and a new weight is obtained. We can now compare these two types of activation functions more clearly.

1- If the activating function is a linear function, such as: F(x) = 2 * x

then:

the new weight will be:

As you can see, all the weights are updated equally and it does not matter what the input value is!!

2- But if we use a non-linear activation function like Tanh(x) then:

and:

and now we can see the direct effect of input in updating weights! different input value makes different weights changes.

I think the above is enough to answer the question of the topic but it is useful to mention other benefits of using the non-linear activation function.

As mentioned in other answers, non-linearity enables NNs to have more hidden layers and deeper NNs. A sequence of layers with a linear activator function can be merged as a layer (with a combination of previous functions) and is practically a neural network with a hidden layer, which does not take advantage of the benefits of deep NN.

Non-linear activation function can also produce a normalized output.

Hi Mohammed, I believe your answer is incorrect. It is not true that when using a linear activation function "all the weights are updated equally and it does not matter what the input value is!!". Consider the single layer single neuron neural net with 1D input x. Suppose for simplicity that as a loss function we minimise the output of the net. The gradient (or just derivative) w.r.t. the weights would be equal to x * df / dz, where f is the linear activation function f(z) = z. As you can see, the model *would* be able to adjust the weight according to the input x. — Mr. President, Nov 23 '20 at 14:33
Mohammed, if you were correct, then a linear Perceptron would not be able to tell different classes in linearly separable spaces, and that is simply untrue. If you want, you can use Tensorflow online (http://playground.tensorflow.org/) to build a linear Perceptron and check that. — Humberto Fioravante Ferro, Aug 12 '21 at 12:33
What is the consequence if the activation function is a constant? Thank you. — Sophia, Sep 14 '22 at 19:48

Safak Ozdek · Answer 10 · 2019-02-04T12:04:07.003

To understand the logic behind non-linear activation functions first you should understand why activation functions are used. In general, real world problems requires non-linear solutions which are not trivial. So we need some functions to generate the non-linearity. Basically what an activation function does is to generate this non-linearity while mapping input values into a desired range.

However, linear activation functions could be used in very limited set of cases where you do not need hidden layers such as linear regression. Usually, it is pointless to generate a neural network for this kind of problems because independent from number of hidden layers, this network will generate a linear combination of inputs which can be done in just one step. In other words, it behaves like a single layer.

There are also a few more desirable properties for activation functions such as continuous differentiability. Since we are using backpropagation the function we generate must be differentiable at any point. I strongly advise you to check the wikipedia page for activation functions from here to have a better understanding of the topic.

score 2 · Answer 11 · edited Mar 20 '12 at 12:26

2

As I remember - sigmoid functions are used because their derivative that fits in BP algorithm is easy to calculate, something simple like f(x)(1-f(x)). I don't remember exactly the math. Actually any function with derivatives can be used.

edited Mar 20 '12 at 12:26

Atilla Ozgur

14,339
3
49
69

answered Mar 20 '12 at 08:56

Anton

1,409
1
19
37

8

The function still wants to be monotonically increasing, as I recall. So, not *any* function. – Novak Mar 20 '12 at 19:01

score 1 · Answer 12 · answered May 21 '15 at 15:24

1

A layered NN of several neurons can be used to learn linearly inseparable problems. For example XOR function can be obtained with two layers with step activation function.

answered May 21 '15 at 15:24

david

11
1

eski · Answer 13 · 2016-01-15T18:00:22.957

-4

It's not at all a requirement. In fact, the rectified linear activation function is very useful in large neural networks. Computing the gradient is much faster, and it induces sparsity by setting a minimum bound at 0.

See the following for more details: https://www.academia.edu/7826776/Mathematical_Intuition_for_Performance_of_Rectified_Linear_Unit_in_Deep_Neural_Networks

Edit:

There has been some discussion over whether the rectified linear activation function can be called a linear function.

Yes, it is technically a nonlinear function because it is not linear at the point x=0, however, it is still correct to say that it is linear at all other points, so I don't think it's that useful to nitpick here,

I could have chosen the identity function and it would still be true, but I chose ReLU as an example because of its recent popularity.

edited Jan 15 '16 at 18:00

answered Nov 05 '14 at 18:28

eski

7,917
1
23
34

10

The rectified linear activation function is also non-linear (despite its name). It is just linear for positive values – Plankalkül Aug 21 '15 at 09:08
4

You're technically correct, it's not linear across the entire domain, specifically at x=0 (it is linear for x < 0 actually, since f(x) = 0 is a linear function). It's also not differentiable so the gradient function isn't fully computable either, but in practice these technicalities are easy to overcome. – eski Aug 21 '15 at 17:00
4

He's not only technically correct, he's also right in practice (or something like that). It is the non-linearity of the ReLU that make them useful. If they would have been linear, they would have had an activation function on the form `f(x) = a*x` (because that is the only type of linear activation function there is), which is _useless_ as an activation function (unless you combine it with non-linear activation functions). – HelloGoodbye Jan 15 '16 at 17:11
@HelloGoodbye A function that is linear across its entire domain (like your example) isn't useless as an activation function. It can still be used to model complex functions and patterns, it might not be the best choice when comparing it to ReLU in certain situations, but that doesn't make it useless. – eski Jan 15 '16 at 17:48
2

What do you mean by "complex functions and patterns"? If you only have linear activation functions, the entire network can only model linear transformations between input and output. And since you can model any linear transformation you want with only a direct connection between input and output layers, your entire network will not become any better than a network with no hidden layers in it, no matter how many hidden layers you use. – HelloGoodbye Jan 16 '16 at 03:25
As a comment on the edit to the answer, the non-linearity at the point x=0 in the ReLU is significant, and it makes the ReLU non-linear. This is not not-picking, this non-linearity is a requirement, which the identity function does simply not have. You can prove that with the identity function, you can _only_ model first order functions, not functions of any higher order than that. – HelloGoodbye Jan 17 '16 at 11:17
Are several linear activation funtion based units connected to each other redundant ? – Francisco Vargas Feb 05 '16 at 17:36
Are you asking about strictly linear functions or also piecewise linear functions? Strictly linear functions are redundant after the learning is finished, although I'm not so sure they are useless during the learning phase and haven't seen evidence to support that they are. – eski Feb 05 '16 at 17:58
13

Rectified Linear Unit (ReLU) is not linear, and it's not just a "minor detail" that people are nitpicking, it's a significant important reason of why it is useful to begin with. A neural network with the identity matrix or a regular linear unit used as the activation function would not be able to model non linear functions. Just because it's linear above 0 doesn't mean it's practically a linear function. A leaky ReLU is "linear" below 0 as well but it's still not a linear function and definitely can't just be replaced by the identity function. Nonlinearity is most definitely a requirement. – Essam Al-Mansouri Mar 03 '16 at 07:02
3

It's actual a concept called a piecewise linear function. – eski Mar 03 '16 at 13:09
2

Piecewise linear is still non-linear. It fails to satisfy both the properties of a linear function: f(kx) = k*f(x), when you choose negative x and k, and f(x+y) = f(x) + f(y), when you choose only one of x and y to be negative and the other to be positive. – rationalis Nov 01 '16 at 05:37
@rationalis Yes, that's been explained several times now. – eski Nov 02 '16 at 16:49
"Yes, it is technically a nonlinear function because it is not linear at the point x=0", I think what you mean is; it is discontinuous at x=0. – ifyalciner Dec 06 '17 at 19:37
Has anyone ever done an experiment to test this? If yes, can he share the link to the results? (favourably for well-known architectures such as ResNet on some dataset other than MNIST) – mcExchange Mar 19 '20 at 10:35

Why must a nonlinear activation function be used in a backpropagation neural network?

13 Answers13

Linked