2

I have a big dataset composed of 18260 input field with 4 outputs. I am using Keras and Tensorflow to build a neural network that can detect the possible output.

However I tried many solutions but the accuracy is not getting above 55% unless I use sigmoid activation function in all model layers except the first one as below:

def baseline_model(optimizer= 'adam' , init= 'random_uniform'):
# create model
model = Sequential()
model.add(Dense(40, input_dim=18260, activation="relu", kernel_initializer=init))
model.add(Dense(40, activation="sigmoid", kernel_initializer=init))
model.add(Dense(40, activation="sigmoid", kernel_initializer=init))
model.add(Dense(10, activation="sigmoid", kernel_initializer=init))
model.add(Dense(4, activation="sigmoid", kernel_initializer=init))
model.summary()
# Compile model
model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model

Is using sigmoid for activation correct in all layers? The accuracy is reaching 99.9% when using sigmoid as shown above. So I was wondering if there is something wrong in the model implementation.

Milo Lu
  • 3,176
  • 3
  • 35
  • 46
Ahmad Hijazi
  • 635
  • 2
  • 9
  • 27
  • Doesn't look like a serious question but like someone trying to get their homework made... I dunno Keras, but neural networks are not magic: you need to parametrize in order to get good results, and activation functions are just another parameter (for what's worth in this case). The question would be: If you got a NN that gets 99.9% accuracy, why do you want to use a different function for the same dataset? This is how a sigmoid looks like: https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png – DGoiko Nov 30 '18 at 09:20
  • 1
    Thanks for your answer. But you can check below the answers to know that this is not someone trying to get their homework done:) – Ahmad Hijazi Nov 30 '18 at 09:30
  • Still looks like homework to me. It's the typical thinking question of: Why would you use X given that Y performs in Z way? You seem to be new to artificial inteligence, and it seems to be an ok question to start making you thing the way IA works, and its not good that you try to skip it by asking it here :(. Now, to answer your question, a neural network is just a mathematical function which heavily depends on activation functions. Using activation functions such as sigmoid prevent the neural network from giving too high values that would make it impossible (...) – DGoiko Nov 30 '18 at 09:34
  • to learn properly because every neuron would be getting too high values as input and activating everything. This is known as the exploding gradient problem (we work on algorithms to substitute backpropagation that doesn't suffer this problem). Now, if you still want to use an unbounded function like ReLU your learning process has to take this into consideration and try to avoid it. Rule of thumb for newbies: use batch training, it makes almost every result better with zero previous knowledge. – DGoiko Nov 30 '18 at 09:36
  • Keep in mind that backpropagation works with derivatives it has to calculate, so the gradient propagation is much cheaper using functions like ReLU and so. As I said before, NNs are not toys, and there's no universal recipe or response with so few serious data, and anyone claiming to give you a serious answer it pretty much a big mouth. You need to read and understand what your doing, and that it why I thought this is homework: this is the kind of easy question a student would make without knowing what is he doing wrong while asking xD – DGoiko Nov 30 '18 at 09:38

3 Answers3

6

The sigmoid might work. But I suggest using relu activation for hidden layers' activation. The problem is, your output layer's activation is sigmoid but it should be softmax(because you are using sparse_categorical_crossentropy loss).

model.add(Dense(4, activation="softmax", kernel_initializer=init))

Edit after discussion on comments

Your outputs are integers for class labels. Sigmoid logistic function outputs values in range (0,1). The output of the softmax is also in range (0,1), but the Softmax function adds another constraint on the outputs:- the sum of the outputs must be 1. Therefore the outputs of softmax can be interpreted as probability of the input being each class.

E.g


def sigmoid(x): 
    return 1.0/(1 + np.exp(-x))

def softmax(a): 
    return np.exp(a-max(a))/np.sum(np.exp(a-max(a))) 

a = np.array([0.6, 10, -5, 4, 7])
print(sigmoid(a))
# [0.64565631, 0.9999546 , 0.00669285, 0.98201379, 0.99908895]
print(softmax(a))
# [7.86089760e-05, 9.50255231e-01, 2.90685280e-07, 2.35544722e-03,
       4.73104222e-02]
print(sum(softmax(a))
# 1.0
Mitiku
  • 5,337
  • 3
  • 18
  • 35
  • Thanks for answering. But if using sigmoid is resulting in 99% accuracy. Then why should I use softmax? Another question, if I reduce the number of hidden layers and remove 2 of them, is using sigmoid as in the question becomes correct? – Ahmad Hijazi Nov 30 '18 at 08:45
  • I assume your output labels are class labels out of four classes. Am i right? – Mitiku Nov 30 '18 at 08:52
  • Yes the output labels are 4 integers (1, 2, 3 or 4). However I tried using softmax now and it works fine. Thanks. Why did you suggest using relu? When using relu I am not getting accuracy over 55%! – Ahmad Hijazi Nov 30 '18 at 08:53
  • 2
    Relu's often tends to work better than sigmoid with hidden layers, because when input's to activation function is very high or very low(large negative number), the derivative of sigmoid funtion becomes close to zero. This hinders the learning of the model. But if in your case sigmoid is working better then it is fine. – Mitiku Nov 30 '18 at 08:56
0

You got to use one or the other activation, as activations are the source to bring non-linearity into the model. If the model doesn't have any activation, then it basically behaves like a single layer network. Read more about 'Why to use activations here'. You can check various activations here.

Although it seems like your model is overfitting when using sigmoid, so try techniques to overcome it like creating train/dev/test sets, reducing complexity of the model, dropouts, etc.

parthagar
  • 880
  • 1
  • 7
  • 18
  • Thanks for the answer, but I am still a bit confused regarding my question. Is using sigmoid as shown above something wrong? I know there are a lot more activation functions, but I want to use sigmoid for all layers except the first. So I need to know if this is something right or no. – Ahmad Hijazi Nov 30 '18 at 08:38
  • You have the liberty to use any activation you want in the hidden layers. But you should use 'softmax' at the output layer as you are classifying into 4 classes. Also you should use 'categorical_crossentropy' as the loss. Read more here https://jovianlin.io/cat-crossentropy-vs-sparse-cat-crossentropy/. Just remember to check for overfitting. – parthagar Nov 30 '18 at 08:46
  • The outputs are integers (1, 2, 3 and 4), so as per your suggested link I should use sparse_categorical_crossentropy – Ahmad Hijazi Nov 30 '18 at 08:49
  • It depends on what your labels are. If they are something like [2, 3, 1, 0] use sparse_categorical_crossentropy, if they are like [[0, 0, 1, 0], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0]] use categorical_crossentropy. Refer to (https://stackoverflow.com/questions/37312421/whats-the-difference-between-sparse-softmax-cross-entropy-with-logits-and-softm), (https://www.dlology.com/blog/how-to-use-keras-sparse_categorical_crossentropy/). – parthagar Nov 30 '18 at 08:57
  • Yupp, so you would use sparse_categorical_crossentropy. – parthagar Nov 30 '18 at 09:07
  • Is it really? 'Linear' activation is applied when none is specified and I think you should read a bit more about 'Linear' activation, when you stack layers with linear output, they act as a Linear function. How did you come to the conclusion that no output is given? That is totally wrong. – parthagar Nov 30 '18 at 17:35
-1

Neural networks require non-linearity at each layer to work. Without non-linear activation no matter how many layers you have, you could write the same thing with only one layer.

Linear functions are limited in complexity and if "g" and "f" are linear functions g(f(x)) could be written as z(x) where z is also a linear function. It is pointless to stack them without adding non-linearity.

And that's why we use non-linear activation functions. sigmoid(g(f(x))) cannot be written as a linear function.

  • Thanks for answering. So as a result, I should either use sigmoid in one layer only or modify the activation functions in all layers not to keep sigmoid in all? – Ahmad Hijazi Nov 30 '18 at 08:40
  • I cannot advice you which activation to use, try them out and see which one works better. You could use different activation functions for each layer if you want to. Just use one activation for each layer – Mete Han Kahraman Nov 30 '18 at 08:47
  • 1
    You got it wrong buddy. You cannot use two CONTIGUOUS linear layers (your explanation about it is OK). There's no harm in using them, as long as you stack'em between non-lineas functions. PD: You got terms wrong. Lineas functions are also activation functions (the Identity is, after all, an activation function, just too give you an example). NN's are neither restricted to same activation function for whole layer, its just a common recipe that simplifies implementation and solution-thinking. – DGoiko Nov 30 '18 at 09:24
  • You are right, an activation function can be linear (you still want to use non-linear ones for NNs) I'll edit the answer. – Mete Han Kahraman Nov 30 '18 at 09:56