Is activation only used for non-linearity or for both problems . I am still confused why do we need activation function and how can it help.
-
5Possible duplicate of [Why must a nonlinear activation function be used in a backpropagation neural network?](https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net) – dennlinger Aug 10 '18 at 05:48
1 Answers
Generally, such a question would be suited for Stats Stackexchange or the Data Science Stackexchange, since it is a purely theoretical question, and not directly related to programming (which is what Stackoverflow is for).
Anyways, I am assuming that you are referring to the classes of linearly separable and not linearly separable problems when you talk about "both problems. In fact, non-linearity in a function is always used, no matter which kind of problem you are trying to solve with a neural network.The simple reason for non-linearities as activation function is simply the following:
Every layer in the network consists of a sequence of linear operations, plus the non-linearity.
Formally - and this is something you might have seen before - you can express the mathemtical operation of a single layer F
and it's input h
as:
F(h) = Wh + b
where W
represents a matrix of weights, plus a bias b
. This operation is purely sequential, and for a simple multi-layer perceptron (with n
layers and without non-linearities), we can write the calculations as follows:
y = F_n(F_n-1(F_n-2(...(F_1(x))))
which is equivalent to
y = W_n W_n-1 W_n-2 ... W_1 x + b_1 + b_2 + ... + b_n
Specifically, we note that these are only multiplications and additions, which we can rearrange in any way we like; particularly, we could aggregate this into one uber-matrix W_p and bias b_p, to rewrite it in a single formula:
y = W_p x + b_p
This has the same expressive power as the above multi-layer perceptron, but can inherently be modeled by a single layer! (While having much less parameters than before).
Introducing non-linearities to this equation turns the simple "building blocks" F(h)
into:
F(h) = g(Wh + b)
Now, the reformulation of a sequence of layers is not possible anymore, and then non-linearity additionally allows us to approximate any arbitrary function.
EDIT: To address another concern of yours ("how does it help?"), I should explicitly mention that not every function is linearly separable, and thus cannot be solved by a purely linear network (i.e. without non-linearities). One classic simple example is the XOR operator.

- 9,890
- 1
- 42
- 63