Stacking RBMs to create Deep belief network in sklearn

Question

According to this website, deep belief network is just stacking multiple RBMs together, using the output of previous RBM as the input of next RBM.

In the scikit-learn documentation, there is one example of using RBM to classify MNIST dataset. They put a RBM and a LogisticRegression in a pipeline to achieve better accuracy.

Therefore I wonder if I can add multiple RBM into that pipeline to create a Deep Belief Networks as shown in the following code.

from sklearn.neural_network import BernoulliRBM
import numpy as np
from sklearn import linear_model, datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

digits = datasets.load_digits()
X = np.asarray(digits.data, 'float32')
Y = digits.target
X = (X - np.min(X, 0)) / (np.max(X, 0) + 0.0001)  # 0-1 scaling

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,
                                                    random_state=0)

logistic = linear_model.LogisticRegression(C=100)
rbm1 = BernoulliRBM(n_components=100, learning_rate=0.06, n_iter=100, verbose=1, random_state=101)
rbm2 = BernoulliRBM(n_components=80, learning_rate=0.06, n_iter=100, verbose=1, random_state=101)
rbm3 = BernoulliRBM(n_components=60, learning_rate=0.06, n_iter=100, verbose=1, random_state=101)
DBN3 = Pipeline(steps=[('rbm1', rbm1),('rbm2', rbm2), ('rbm3', rbm3), ('logistic', logistic)])

DBN3.fit(X_train, Y_train)

print("Logistic regression using RBM features:\n%s\n" % (
    metrics.classification_report(
        Y_test,
        DBN3.predict(X_test))))

However, I discover that the more RBM I add to the pipeline, the less the accuracy is.

1 RBM in pipeline --> 95%

2 RBMs in pipeline --> 93%

3 RBMs in pipeline --> 89%

The training curve below shows that 100 iterations is just right for convergent. More iterations will cause over-fitting and the likelihood will go down again.

Batch size = 10

Batch size = 256 or above

I have noticed one interesting thing. If I use a higher batch size, the performance of the network deteriorates a lot. When the batch size is above 256, the accuracy drops to only less than 10%. The training curve somehow doesn't make sense to me, with first and second RBMs don't learn much, but the third RBM suddenly learns quickly.

It looks like 89% is somehow the bottleneck for a network with 3 RBMs.

I wonder if I am doing anything wrong here. Is my understanding of deep belief network correct?

Be mindful that the more RBM you stack together, the more parameters have to be estimated. 100 iterations may not be enough. Have you checked whether the models converge or not? Have you checked the validation loss? It typically should go down for a while, and then at some point, overfitting occurs, and it starts going up. — Eskapp, Sep 04 '18 at 12:38
Stacks of RBMs are trained in a greedy fashion, i.e. you first fully train the lowest layer, then collect some sample encodings to train the next layer with, then you train the next layer, etc. I am not familiar with `sklearn.pipeline`, however, given that it seems to be a general purpose tool for optimising chains of models/transformations, I would assume that it tries to train/fit all models synchronously rather than sequentially. — Paul Brodersen, Sep 04 '18 at 12:49
Another, more subtle problem in your stack of RBMs is that all layers have the same number of units and hence you are not forcing successive layers to generalize successively more. In effect, you are not "distributing" (for lack of a better word) the learning process across layers, as once the transformation of the first layer is learnt, all successive layers can simply learn the identity transform (and quite likely will as it is the most easy transform to learn). All you are doing in the 2nd and 3rd layer at the moment is adding noise (due to the nonlinearity). — Paul Brodersen, Sep 04 '18 at 12:57
Make your layers successively smaller. A factor of 1/2 works well for MNIST. Don't compress to more than 30-50 units. — Paul Brodersen, Sep 04 '18 at 12:57
@alvas This a stack of RBMs, not backprop. RBMs are trained by some variant of constrastive divergence, which operates on a pair of layers at a time. So no, gradients are not propagated here and that is the expected (and desired) behaviour. — Paul Brodersen, Sep 04 '18 at 13:02
@Paul Brodersen I have tried reducing the number of units for successive layers, but the accuracy is more or less the same. — Raven Cheuk, Sep 04 '18 at 13:05
@Eskapp I have checked that the model has converge. More iterations causes over-fitting (the validation accuracy decrease), less iterations doesn't give the best accuracy. 100 iteration is so far the best result I can get. — Raven Cheuk, Sep 04 '18 at 13:05
Ok, I now read the relevant parts of the sklearn documentation and source code. `Pipeline` does the right thing in as much it [trains the layers successively](https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/pipeline.py#L200). The defaults for RBMBernoullli are not great, IMO. `batch_size=10` is way too small -- you want a representative sample, so 512 or 1024 samples would be better. Also, I would run at least for 10 epochs, i.e. 500k samples total. — Paul Brodersen, Sep 04 '18 at 13:26
Regarding the network architecture, I would go for 400-200-100 (maybe even adding a final layer with 50 units). There is not a whole lot left to learn after the first layer in a 100-80-60 network. — Paul Brodersen, Sep 04 '18 at 13:26
@Paul Brodersen There is no epochs parameter for the `BernoulliRBM`. Do I need to use a for loop to fit the model 10 times? — Raven Cheuk, Sep 04 '18 at 13:49
@Paul Brodersen Strangely, a larger batch size like 512 yields a extremely poor result with accuracy only 4%. Only when the batch size is under 30, the accuracy is in the range of 79%~89%. It seems 89% accuracy is the best result I can achieve with 3 RBMs. — Raven Cheuk, Sep 04 '18 at 14:26
Yeah, I have been playing around with the code and there is something extremely fishy with the RBM implementation in sklearn; potentially even a serious bug. For example, when using a single layer RBM with 200 units, setting the learning rate to zero actually improves the precision/recall compared to using a learning rate that results in a substantial increase of the log likelihood. If I were you, I would take a step back, make some tools to plot the RBM reconstructions and receptive fields, and then simply try to train a single layer RBM to do the right thing. Then expand from there. — Paul Brodersen, Sep 04 '18 at 15:04
@Paul Brodersen I know how to plot the RBM recontructions. But for the receptive fields part, it is my first time to hear this term. I wonder if it is plotting each component learnt by the RBM? — Raven Cheuk, Sep 05 '18 at 03:06
Yeah, that is correct. In analogy to biological neurons, these components are sometimes called receptive fields. — Paul Brodersen, Sep 05 '18 at 08:57
Also, I tentatively retract my statement about the RBM implementation being bugged. After having read [this paper](http://www.jmlr.org/papers/v11/erhan10a.html), I think that a) the layer sizes are too small, and b) the number of training samples is way, way too low. Have a look at Fig. 9 and 11. — Paul Brodersen, Sep 05 '18 at 10:14

Paul Brodersen · Accepted Answer · 2018-11-27T15:38:45.363

The following is not quite a definitive answer as it lacks any statistical rigor. However, the necessary parameter optimization and evaluation will still take several days of CPU time. Until then I submit the following proof of principle as an answer.

Tl;dr

Larger layers + much longer training => performance of logistic regression by itself < + 1 RBM layer < + RBM stack / DBN

Introduction

As I have stated in one of my comments to OP's post, the use of stacked RBMs / DBNs for unsupervised pre-training has been systematically explored in Erhan et al. (2010). To be precise, their setup differs from OP's setup in so far as after training the DBN, they add a final layer of output neurons and fine-tune the complete network using backprop. OP evaluates the benefit of adding one or more RBM layers using the performance of logistic regression on the output of the final layer. Furthermore, Erhan et al. also don't use the 64 pixel digits data set in scikit-learn but the 784 pixel MNIST images (and variants thereof).

That being said, the similarities are substantial enough to take their findings as the starting point for the evaluation of a scikit-learn implementation of a DBN, which is precisely what I have done: I also use the MNIST data set, and I use the optimal parameters (where reported) by Erhan et al. These parameters differ substantially from the ones given in the example by OP and are likely the source of the poor performance of OP's model: in particular, the layer sizes are much larger and the number of training samples is orders of magnitudes more. However, as OP, I use logistic regression in the final step of the pipeline to evaluate if the image transformations by an RBM or by a stack of RBMs/a DBN improve classification.

Incidentally, having (roughly) as many units in the RBM layers (800 units) as in the original image (784 pixels), also makes pure logistic regression on the raw image pixels a suitable benchmark model.

I hence compare the following 3 models:

logistic regression by itself (i.e. the baseline / benchmark model),
logistic regression on outputs of an RBM, and
logistic regression on outputs of a stacks of RBMs / a DBN.

Results

Consistent with the previous literature, my preliminary results indeed indicate that using the output of an RBM for logistic regression improves the performance compared to just using the raw pixel values by itself, and the DBN transformation yet improves on the RBM although the improvement is smaller.

Logistic regression by itself:

Model performance:
             precision    recall  f1-score   support

        0.0       0.95      0.97      0.96       995
        1.0       0.96      0.98      0.97      1121
        2.0       0.91      0.90      0.90      1015
        3.0       0.90      0.89      0.89      1033
        4.0       0.93      0.92      0.92       976
        5.0       0.90      0.88      0.89       884
        6.0       0.94      0.94      0.94       999
        7.0       0.92      0.93      0.93      1034
        8.0       0.89      0.87      0.88       923
        9.0       0.89      0.90      0.89      1020

avg / total       0.92      0.92      0.92     10000

Logistic regression on outputs of an RBM:

Model performance:
             precision    recall  f1-score   support

        0.0       0.98      0.98      0.98       995
        1.0       0.98      0.99      0.99      1121
        2.0       0.95      0.97      0.96      1015
        3.0       0.97      0.96      0.96      1033
        4.0       0.98      0.97      0.97       976
        5.0       0.97      0.96      0.96       884
        6.0       0.98      0.98      0.98       999
        7.0       0.96      0.97      0.97      1034
        8.0       0.96      0.94      0.95       923
        9.0       0.96      0.96      0.96      1020

avg / total       0.97      0.97      0.97     10000

Logistic regression on outputs of a stacks of RBMs / a DBN:

Model performance:
             precision    recall  f1-score   support

        0.0       0.99      0.99      0.99       995
        1.0       0.99      0.99      0.99      1121
        2.0       0.97      0.98      0.98      1015
        3.0       0.98      0.97      0.97      1033
        4.0       0.98      0.97      0.98       976
        5.0       0.96      0.97      0.97       884
        6.0       0.99      0.98      0.98       999
        7.0       0.98      0.98      0.98      1034
        8.0       0.98      0.97      0.97       923
        9.0       0.96      0.97      0.96      1020

avg / total       0.98      0.98      0.98     10000

Code

#!/usr/bin/env python

"""
Using MNIST, compare classification performance of:
1) logistic regression by itself,
2) logistic regression on outputs of an RBM, and
3) logistic regression on outputs of a stacks of RBMs / a DBN.
"""

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import BernoulliRBM
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report


def norm(arr):
    arr = arr.astype(np.float)
    arr -= arr.min()
    arr /= arr.max()
    return arr


if __name__ == '__main__':

    # load MNIST data set
    mnist = fetch_mldata('MNIST original')
    X, Y = mnist.data, mnist.target

    # normalize inputs to 0-1 range
    X = norm(X)

    # split into train, validation, and test data sets
    X_train, X_test, Y_train, Y_test = train_test_split(X,       Y,       test_size=10000, random_state=0)
    X_train, X_val,  Y_train, Y_val  = train_test_split(X_train, Y_train, test_size=10000, random_state=0)

    # --------------------------------------------------------------------------------
    # set hyperparameters

    learning_rate = 0.02 # from Erhan et el. (2010): median value in grid-search
    total_units   =  800 # from Erhan et el. (2010): optimal for MNIST / only slightly worse than 1200 units when using InfiniteMNIST
    total_epochs  =   50 # from Erhan et el. (2010): optimal for MNIST
    batch_size    =  128 # seems like a representative sample; backprop literature often uses 256 or 512 samples

    C = 100. # optimum for benchmark model according to sklearn docs: https://scikit-learn.org/stable/auto_examples/neural_networks/plot_rbm_logistic_classification.html#sphx-glr-auto-examples-neural-networks-plot-rbm-logistic-classification-py)

    # TODO optimize using grid search, etc

    # --------------------------------------------------------------------------------
    # construct models

    # RBM
    rbm = BernoulliRBM(n_components=total_units, learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1)

    # "output layer"
    logistic = LogisticRegression(C=C, solver='lbfgs', multi_class='multinomial', max_iter=200, verbose=1)

    models = []
    models.append(Pipeline(steps=[('logistic', clone(logistic))]))                                              # base model / benchmark
    models.append(Pipeline(steps=[('rbm1', clone(rbm)), ('logistic', clone(logistic))]))                        # single RBM
    models.append(Pipeline(steps=[('rbm1', clone(rbm)), ('rbm2', clone(rbm)), ('logistic', clone(logistic))]))  # RBM stack / DBN

    # --------------------------------------------------------------------------------
    # train and evaluate models

    for model in models:
        # train
        model.fit(X_train, Y_train)

        # evaluate using validation set
        print("Model performance:\n%s\n" % (
            classification_report(Y_val, model.predict(X_val))))

    # TODO: after parameter optimization, evaluate on test set

@Ahmad Erhan et al. keep the size of the layers constant (for any given experiment). As I am using their hyperparameters, I followed their example in this as well. However, from first principles, having successively smaller layers should be better as you force greater generalization. — Paul Brodersen, Nov 28 '18 at 10:26
@Ahmad Additionally, a constant layer size of 800 has the advantage that we are not changing the input size to the logistic regression. Thus we keep the comparability between the benchmark (pure logistic regression) and the setups with 1 or 2 RBM layers. If the layers were successively smaller, the logistic regression could have an increasingly easier job, which might thus skew the results (independent of whether having additional RBM layers improves performance by itself). — Paul Brodersen, Nov 28 '18 at 10:33
Thank you very much! Some comments mentioned that using `pipeline` method of `sci-kit` isn't a correct way to implement a DBN. Are you sure that the implementation is the correct method for building a DBN? because I never have seen such a succinct code for DBN. — Ahmad, Nov 28 '18 at 15:59
@Ahmad If you look closely at that comment about `Pipeline`, you will notice that it was my own -- at the time uninformed -- opinion. I then went back and looked at the source code of `Pipeline`, which shows that it is indeed doing the right thing, and I retracted my earlier statement. So yeah, this stack of RBMs is trained in the correct fashion to the best of my knowledge, and the improvement in performance when using one or two RBM layers is strong evidence in favour of the correctness as well. — Paul Brodersen, Nov 28 '18 at 16:09
@Ahmad Similarly, my later comments about `scikit-learn`'s`BernoulliRBM` being bugged were also wrong. I was just in a very bad place in parameter space and hence I got very unexpected (and bad) results. In summary, `scikit-learn` works just fine, all of the problems I raised in my comments to OP's post were just user errors. — Paul Brodersen, Nov 28 '18 at 16:14
So, to make the DBN works, we just need more samples (from 1797 to 10,000), more features (64 pixels to 784 pixels)? — Raven Cheuk, Nov 29 '18 at 03:42
@RavenCheuk In my (and Erhan et al.'s) setup, the RBM layers are trained on 50 000 samples for 50 epochs / repetitions. I think this strongly contributes to the increase in performance. Having more features is just a by-product of using MNIST instead of the digits data set. I would be surprised if that made learning easier. I am reluctant to say how changes in the other parameters (batch size, training rate) may have contributed without further testing. In the meantime, I can warmly recommend the Erhan et al. paper. It's a good paper and easy read. — Paul Brodersen, Nov 29 '18 at 13:03