1

I am following the book Grokking Deep Learning (Ch 8, code here) to build a Numpy Neural network which can classify MNIST digits with ~82% test accuracy. But when I modify the NN to work on a synthetic dataset, it goes to a specific train accuracy (depending on dimension of hidden layer, alpha) and stays there right from the start of training. Please check:

import numpy as np
import sys
from sklearn import datasets

X, y = datasets.make_classification(n_samples=10000, n_features=5, n_classes=4, 
                                        n_clusters_per_class=1, shuffle=True, random_state=1)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

def relu(x):
    return (x >= 0) * x # returns x if x > 0
                        # returns 0 otherwise

def relu2deriv(output):
    return output >= 0 #returns 1 for input > 0

def onehot(arr):
    one_hot_labels = np.zeros((len(arr),4))
    for i,l in enumerate(arr):
        one_hot_labels[i][l] = 1
        
    return one_hot_labels

y_train = onehot(y_train)
y_test = onehot(y_test)

alpha, iterations, hidden_size = (0.002, 300, 10)

weights_0_1 = 0.2*np.random.random((5, hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size, 4)) - 0.1

for j in range(iterations):
    error, correct_cnt = (0.0,0)
    for i in range(len(X_train)):
        layer_0 = X_train[i:i+1]
        layer_1 = relu(np.dot(layer_0,weights_0_1))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = np.dot(layer_1,weights_1_2)

        error += np.sum((y_train[i:i+1] - layer_2) ** 2)
        correct_cnt += int(np.argmax(layer_2) == np.argmax(y_train[i:i+1]))
        layer_2_delta = (y_train[i:i+1] - layer_2)
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        layer_1_delta *= dropout_mask

        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    if(j%1 == 0):  # can be set for any interval
        test_error = 0.0
        test_correct_cnt = 0

        for i in range(len(X_test)):
            layer_0 = X_test[i:i+1]
            layer_1 = relu(np.dot(layer_0, weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)

            test_error += np.sum((y_test[i:i+1] - layer_2) ** 2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(y_test[i:i+1]))
        
        sys.stdout.write("\n" + \
                     "I:" + str(j) + \
                     " Test-Err:" + str(test_error/ float(len(X_test)))[0:5] +\
                     " Test-Acc:" + str(test_correct_cnt/ float(len(X_test)))+\
                     " Train-Err:" + str(error/ float(len(X_train)))[0:5] +\
                     " Train-Acc:" + str(correct_cnt/ float(len(X_train))))

Output:

I:0 Test-Err:0.470 Test-Acc:0.812 Train-Err:0.704 Train-Acc:0.572
I:1 Test-Err:0.452 Test-Acc:0.811 Train-Err:0.574 Train-Acc:0.626625
I:2 Test-Err:0.445 Test-Acc:0.814 Train-Err:0.571 Train-Acc:0.61425
    .
    .
    .
I:297 Test-Err:0.470 Test-Acc:0.7685 Train-Err:0.613 Train-Acc:0.6045
I:298 Test-Err:0.492 Test-Acc:0.785 Train-Err:0.612 Train-Acc:0.60525
I:299 Test-Err:0.478 Test-Acc:0.778 Train-Err:0.614 Train-Acc:0.60725

What's going on? How can this NN perform on the MNIST dataset but not on this dataset?

Gulzar
  • 23,452
  • 27
  • 113
  • 201
Anirban
  • 23
  • 5
  • What's the synthetic dataset? – Gulzar Apr 04 '21 at 10:08
  • The synthetic dataset is the classification dataset i created using `sklearn.datasets.make_classification(.)` method. I have edited the question to include the import statement for `sklearn.datasets`. – Anirban Apr 04 '21 at 10:13

1 Answers1

1

I believe the problem lies in the fact that you have no bias terms.

For example,

layer_1 = relu(np.dot(layer_0,weights_0_1))

Geometrically, that means the output of layer 1 (and the rest of the layers) has no translation term, which makes it so the decision boundary is forced to pass through the origin.

See visualization

Thus it may be impossible for a decision boundary to be learned for data that is not around 0.
Think of data that is closely clustered around (0, 1) and (0, 2) for binary classification. No linear boundary that passes through (0, 0) can separate those clusters.

See here for a nice explanation on why bias is required.


I believe (and did not check), that adding bias terms should allow for convergence.

layer_1 = relu(np.dot(layer_0,weights_0_1) + layer_0_bias) 

and so on.

The bias' derivative is discussed here.


There are more possible reasons.

  1. layer_2's output is the output, on which MSELoss is calculated, instead of using NLL or CrossEntropy Loss.
  2. No normalization of inputs occurs, which may cause the net to not learn. This is not likely for the synthetic data which comes from a hypercube, but is likely for other general data.
Gulzar
  • 23,452
  • 27
  • 113
  • 201
  • Thank you for pointing out the bias term. I tried to add the bias with the following lines: `b = np.zeros(10).reshape(1, 10)` and in the calculation: `layer_1 = relu(np.dot(layer_0,weights_0_1) + b)` . . . `b_delta = np.sum(layer_1_delta, axis = 0)` `b += -alpha*b_delta' The output is still strange: I:0 Test-Err:0.471 Test-Acc:0.8 Train-Err:0.780 Train-Acc:0.541625 I:1 Test-Err:0.476 Test-Acc:0.808 Train-Err:0.668 Train-Acc:0.631375 I:2 Test-Err:0.465 Test-Acc:0.8105 Train-Err:0.674 Train-Acc:0.65175 and the error does not converge. – Anirban Apr 05 '21 at 04:25
  • @anirban please post a new question with the updated code. – Gulzar Apr 05 '21 at 12:39