Can a neural network having non-linear activation function (say ReLU) be used for linear classification task?

Question

I think the answer would be yes, but I'm unable to reason out a good explanation on this.

Yes. In very caricatural terms: Linear problem = easy problem. Linear tool = simple but not powerful tool. Non-linear problem = hard problem. Non-linear tool = complicated but powerful tool. Can I use my complicated and powerful tool to solve an easy problem? Yes, you can. — Stef, Jan 07 '22 at 19:41

score 1 · Answer 1 · answered Jan 10 '22 at 22:30

The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:

Lemma 1

With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k. Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (b-a).

Lemma 2

For sufficiently small scale, many non-linearities are approximately linear. This is actually very much a definition of a derivative, or, taylor expansion. In particular let us take relu(x), for x>0 it is in fact, linear! What about sigmoid? Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0!

Lemma 3

Composition of affine functions is affine. In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one. This comes from the matrix composition rules:

W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
                                        ------    -----------
                                    New weights   New bias

Combining the above

Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function! We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear, and then we "map it back" in the following layer.

General case

This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones. This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.

James Barnett · Accepted Answer · 2022-01-09T12:11:22.423

0

Technically, yes.

The reason you could use a non-linear activation function for this task is that you can manually alter the results. Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1. Just to be clear, rounding up or down isn't linear activation, but for this specific question the purpose of the network was for classification, where some kind of rounding has to be applied.

The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.

I hope this answer helped, have a good day!

edited Jan 09 '22 at 12:11

answered Jan 07 '22 at 21:27

James Barnett

561
5
18

linear does not mean binary yes/no. – lejlot Jan 08 '22 at 00:10
@lejlot I meant more in terms of his specific use, vasu said that they were going to use their network for classification in which you do need some kind of rounding, the rounding to a 1 or 0 was just an example. Sorry I should've said that more clearly in my answer. – James Barnett Jan 09 '22 at 12:10
@lejlot I have updated my answer to be more specific, I hope that helps! – James Barnett Jan 09 '22 at 12:12
It still does not explain why non-linear activations can be used. I believe for some reason you focused on "output" neuron, while in neural networks activation function usually refers to internal nodes. For non-linear activation function to be "ok" to use in a linear problem one needs to show it can effectively represent **identity**, not rounding. – lejlot Jan 09 '22 at 12:44