Is activation only used for non-linearity ?

Question

Is activation only used for non-linearity or for both problems . I am still confused why do we need activation function and how can it help.

Possible duplicate of [Why must a nonlinear activation function be used in a backpropagation neural network?](https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net) — dennlinger, Aug 10 '18 at 05:48

score 1 · Answer 1 · answered Aug 10 '18 at 05:46

Generally, such a question would be suited for Stats Stackexchange or the Data Science Stackexchange, since it is a purely theoretical question, and not directly related to programming (which is what Stackoverflow is for).

Anyways, I am assuming that you are referring to the classes of linearly separable and not linearly separable problems when you talk about "both problems. In fact, non-linearity in a function is always used, no matter which kind of problem you are trying to solve with a neural network.The simple reason for non-linearities as activation function is simply the following:

Every layer in the network consists of a sequence of linear operations, plus the non-linearity.

Formally - and this is something you might have seen before - you can express the mathemtical operation of a single layer F and it's input h as:

F(h) = Wh + b

where W represents a matrix of weights, plus a bias b. This operation is purely sequential, and for a simple multi-layer perceptron (with n layers and without non-linearities), we can write the calculations as follows:

y = F_n(F_n-1(F_n-2(...(F_1(x))))

which is equivalent to

y = W_n W_n-1 W_n-2 ... W_1 x + b_1 + b_2 + ... + b_n

Specifically, we note that these are only multiplications and additions, which we can rearrange in any way we like; particularly, we could aggregate this into one uber-matrix W_p and bias b_p, to rewrite it in a single formula:

y = W_p x + b_p

This has the same expressive power as the above multi-layer perceptron, but can inherently be modeled by a single layer! (While having much less parameters than before).

Introducing non-linearities to this equation turns the simple "building blocks" F(h) into:

F(h) = g(Wh + b)

Now, the reformulation of a sequence of layers is not possible anymore, and then non-linearity additionally allows us to approximate any arbitrary function.

EDIT: To address another concern of yours ("how does it help?"), I should explicitly mention that not every function is linearly separable, and thus cannot be solved by a purely linear network (i.e. without non-linearities). One classic simple example is the XOR operator.

Is activation only used for non-linearity ?

1 Answers1

Every layer in the network consists of a sequence of linear operations, plus the non-linearity.