How to correct unstable loss and accuracy during training? (binary classification)

Question

I am currently working on a small binary classification project using the new keras API in tensorflow. The problem is a simplified version of the Higgs Boson challenge posted on Kaggle.com a few years back. The dataset shape is 2000x14, where the first 13 elements of each row form the input vector, and the 14th element is the corresponding label. Here is a sample of said dataset:

86.043,52.881,61.231,95.475,0.273,77.169,-0.015,1.856,32.636,202.068, 2.432,-0.419,0.0,0
138.149,69.197,58.607,129.848,0.941,120.276,3.811,1.886,71.435,384.916,2.447,1.408,0.0,1
137.457,3.018,74.670,81.705,5.954,775.772,-8.854,2.625,1.942,157.231,1.193,0.873,0.824,1

I am relatively new to machine learning and tensorflow, but I am familiar with the higher level concepts such as loss functions, optimizers and activation functions. I have tried building various models inspired by examples of binary classification problems found online, but I am having difficulties with training the model. During training, the loss somethimes increases within the same epoch, leading to unstable learning. The accuracy hits a plateau around 70%. I have tried changing the learning rate and other hyperparameters but to no avail. In comparison, I have hardcoded a fully-connected feed forward neural net that reaches around 80-85% accuracy on the same problem.

Here is my current model:

import tensorflow as tf
from tensorflow.python.keras.layers.core import Dense
import numpy as np
import pandas as pd

def normalize(array):
    return array/np.linalg.norm(array, ord=2, axis=1, keepdims=True)

x_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, :-1].values
y_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, -1:].values

x_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, :-1].values
y_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, -1:].values

x_train = normalize(x_train)
x_test = normalize(x_test)

model = tf.keras.Sequential()
model.add(Dense(9, input_dim=13, activation=tf.nn.sigmoid)
model.add(Dense(6, activation=tf.nn.sigmoid))
model.add(Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=50)
model.evaluate(x_test, y_test)

As mentionned, some of the epochs start with a higher accuracy than they finish with, leading to unstable learning.

  32/1800 [..............................] - ETA: 0s - loss: 0.6830 - acc: 0.5938
1152/1800 [==================>...........] - ETA: 0s - loss: 0.6175 - acc: 0.6727
1800/1800 [==============================] - 0s 52us/step - loss: 0.6098 - acc: 0.6861
Epoch 54/250

  32/1800 [..............................] - ETA: 0s - loss: 0.5195 - acc: 0.8125
1376/1800 [=====================>........] - ETA: 0s - loss: 0.6224 - acc: 0.6672
1800/1800 [==============================] - 0s 43us/step - loss: 0.6091 - acc: 0.6850
Epoch 55/250

What could be the cause of these oscillations in learning in such a simple model? Thanks

EDIT:

I have followed some suggestions from the comments and have modified the model accordingly. It now looks more like this:

model = tf.keras.Sequential()
model.add(Dense(250, input_dim=13, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(200, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(100, activation=tf.nn.relu))
model.add(Dropout(0.3))
model.add(Dense(50, activation=tf.nn.relu))
model.add(Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer='adadelta',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Can you link to the dataset so the problem can be reproduced with the same data and model? — theberzi, Apr 28 '19 at 20:31

Szymon Maszke · Answer 1 · 2019-04-29T08:05:17.477

Oscillations

Those are most definitely connected to the size of your network; each batch coming through changes your neural network considerably as it does not have enough neurons to represent the relationships.

It works fine for one batch, updates the weights for another and changes previously learned connections effectively "unlearning". That's why the loss is also jumpy as the network tries to accommodate to the task you have given it.

Sigmoid activation and it's saturation may be causing you troubles as well (as the gradient is squashed into small region and most gradient updates are zero). Quick fix - use ReLU activation as described below.

Additionally, neural network does not care about accuracy, only about minimizing the loss value (which it tries to do most of the time). Say it predicts probabilities: [0.55, 0.55, 0.55, 0.55, 0.45] for classes [1, 1, 1, 1, 0] so it's accuracy is 100% but it's pretty uncertain. Now, let's say the next update pushes the network into probabilities predictions: [0.8, 0.8, 0.8, 0.8, 0.55]. In such case, loss would drop, but so would accuracy, from 100% to 80%.

BTW. You may want to check the scores for logistic regression and see how it performs on this task (so a single layer with output only).

Some things to consider

1. Size of your neural network

It's always good to start with simple model and grow it bigger if needed (wouldn't advise the other way around). You may want to check on a really small subsample of data (say two/three batches, 160 elements or so) whether your model can learn the relationship between input and output.

In your case I doubt the model will be able to learn those relationships with the size of layers you are providing. Try increasing the size, especially in the earlier layers (maybe 50/100 for starters) and see how it behaves.

2. Activation function

Sigmoid easily saturates (small region where changes occur, most of the values are almost 0 or 1). It is rarely used nowadays as activation before bottleneck (final layer). Most common nowadays is ReLU which is not prone to saturation (at least when the input is positive) or it's variations. This might help as well.

3. Learning rate

For each dataset and each neural network model optimal choice of learning rate is different. Defaults usually work so-so, but when the learning rate is too small it might get stuck in the local minima (and it's generalization will be worse), while the value being too big will make your network unstable (loss will highly oscillate).

You may want to read up on Cyclical Learning Rate (or in the original research paper by Leslie N. Smith. In there you can find info on how to choose a good learning rate heuristically and setup some simple learning rate schedulers. Those techniques were used by fast.ai teams in CIFAR10 competitions with really good results. On their site or in documentation of their library you can find One Cycle Policy and learning rate finder (based on the work of aforementioned researcher). This should get you started in this realm I think.

4. Normalization

Not sure, but this normalization looks pretty non-standard to me (never seen it done like that). Good normalization is the basis for neural network convergence (unless the data is already pretty close to normal distribution). Usually one subtracts the mean and divides by standard deviation for each feature. You can check some schemes in scikit-learn library for example.

5. Depth

This shouldn't be an issue but if your input is complicated you should consider adding more layers to your neural network (right now it's almost definitely too thin). This would allow it to learn more abstract features and transform the input space more.

Overfitting

When the network overfits to the data you may employ some regularization techniques (hard to tell what might help, you should test it on your own), some of those include:

Higher learning rate with batch normalization smoothing out learning space.
Smaller number of neurons (relationships learned by the network would intuitively have to be more data distribution representative).
Smaller batch size have regularization effect as well.
Dropout, though it's hard to pin-point good dropout rate. Would resort to it as the last one. Furthermore it is known to collide with batch normalization techniques (though there are techniques to combine them, see here or here, you may find more over the web).
L1/L2 regularization with the second being much more widely applied (unless you have specific knowledge indicating L1 might perform better)
Data augmentation - I would try this one first, mostly because of curiosity. As your features are continuous you may want to add some random noise on batch-to-batch basis generated from gaussian distribution. Noise would have to be small, standard deviation around 1e-2 or 1e-3, you would have to test those values experimentally.
Early stopping - after N epochs without improvement on the validation set you end your training. Pretty common technique, should be used almost every time. Remember to save the best model on validation set and set patience (N mentioned above) to some moderately sized value (do not set patience to 1 epoch or so, neural network may easily improve after 5 or so).

Plus there are tons of other techniques you may find. Check what makes intuitive sense and which one you like the most and test how it performs.

Hey Szymon, I have followed most of your suggestions and have added a new version of the model if you are interested in having a look. The behavior of the loss is much more stable and the accuracy on the testing set reaches 80-85%. I still have to read through and implement the dynamic learning rate, which I will do soon. I was just wondering if you had any last suggestions to combat overfitting? The accuracy on the training set tends to 1 but the accuracy on the testing set tops out at 85%, even with the dropout layers. Adding neurons and layers only seems to worsen the problem.Thanks again! — Mustfled, Apr 29 '19 at 03:35
@ÉricPfleiderer you could try examining the data to remove outliers. Other things you could try are to minimally reduce the size of the training set compared to the test set, reduce the dropout rate slightly, and see if "early stopping" helps your model. — theberzi, Apr 29 '19 at 06:16
@ÉricPfleiderer added appropriate section. Suggestion by Federico S with outliers is also a viable option (and the one with early stopping too). On the other hand I would argue against reducing dropout rate (this would drive you more towards overfitting regime IIUC). Reducing train set (except for methods like bagging) might do more harm than good as each training sample is precious to the network. — Szymon Maszke, Apr 29 '19 at 08:10

score 3 · Answer 2 · answered Nov 10 '20 at 19:58

I once trained a siamese network where I realised that if I use higher learning rates the training loss was going down smooth (as expected since that is what the neural network is learning) ,but saw huge ups and downs with the val loss.

This had never happened before when I was using lower learning rate (in the order of 1e-05). I believe that the train loss is actually false since recent papers have proved that large neural networks (I mean neural networks with more complexity) can learn random data flawlessly in the training set, though they performed extremely worse while validating them, I have attached the paper for your reference below which clearly explains this phenomena related to overfitting. So one can't conclude the overall model's performance by just observing the training data.

Though other parameters mentioned above also matter, but I guess one should start tweaking the learning rates initially in such a case before tweaking the model itself.

Link for the paper : https://arxiv.org/pdf/1611.03530

Please correct me if I am wrong...

score 0 · Answer 3 · answered Apr 29 '19 at 00:15

0

All of Symon's points are great, but another possible cause: are you shuffling your dataset? If not and your data contains some ordered bias, your model may be tuning itself to one 'end' of the dataset, only to do poorly at the other 'end'.

answered Apr 29 '19 at 00:15

DomJack

4,098
1
17
32

I think so. According to the tensorflow documentation, the fit() method will shuffle the training set every epoch by default. – Mustfled Apr 29 '19 at 03:09