The following is not quite a definitive answer as it lacks any statistical rigor.
However, the necessary parameter optimization and evaluation will still take several days of CPU time. Until then I submit the following proof of principle as an answer.
Tl;dr
Larger layers + much longer training => performance of logistic regression by itself < + 1 RBM layer < + RBM stack / DBN
Introduction
As I have stated in one of my comments to OP's post, the use of stacked RBMs / DBNs for unsupervised pre-training has been systematically explored in Erhan et al. (2010). To be precise, their setup differs from OP's setup in so far as after training the DBN, they add a final layer of output neurons and fine-tune the complete network using backprop. OP evaluates the benefit of adding one or more RBM layers using the performance of logistic regression on the output of the final layer.
Furthermore, Erhan et al. also don't use the 64 pixel digits
data set in scikit-learn but the 784 pixel MNIST images (and variants thereof).
That being said, the similarities are substantial enough to take their findings as the starting point for the evaluation of a scikit-learn implementation of a DBN, which is precisely what I have done: I also use the MNIST data set, and I use the optimal parameters (where reported) by Erhan et al. These parameters differ substantially from the ones given in the example by OP and are likely the source of the poor performance of OP's model: in particular, the layer sizes are much larger and the number of training samples is orders of magnitudes more. However, as OP, I use logistic regression in the final step of the pipeline to evaluate if the image transformations by an RBM or by a stack of RBMs/a DBN improve classification.
Incidentally, having (roughly) as many units in the RBM layers (800 units) as in the original image (784 pixels), also makes pure logistic regression on the raw image pixels a suitable benchmark model.
I hence compare the following 3 models:
logistic regression by itself (i.e. the baseline / benchmark model),
logistic regression on outputs of an RBM, and
logistic regression on outputs of a stacks of RBMs / a DBN.
Results
Consistent with the previous literature, my preliminary results indeed indicate that using the output of an RBM for logistic regression improves the performance compared to just using the raw pixel values by itself, and the DBN transformation yet improves on the RBM although the improvement is smaller.
Logistic regression by itself:
Model performance:
precision recall f1-score support
0.0 0.95 0.97 0.96 995
1.0 0.96 0.98 0.97 1121
2.0 0.91 0.90 0.90 1015
3.0 0.90 0.89 0.89 1033
4.0 0.93 0.92 0.92 976
5.0 0.90 0.88 0.89 884
6.0 0.94 0.94 0.94 999
7.0 0.92 0.93 0.93 1034
8.0 0.89 0.87 0.88 923
9.0 0.89 0.90 0.89 1020
avg / total 0.92 0.92 0.92 10000
Logistic regression on outputs of an RBM:
Model performance:
precision recall f1-score support
0.0 0.98 0.98 0.98 995
1.0 0.98 0.99 0.99 1121
2.0 0.95 0.97 0.96 1015
3.0 0.97 0.96 0.96 1033
4.0 0.98 0.97 0.97 976
5.0 0.97 0.96 0.96 884
6.0 0.98 0.98 0.98 999
7.0 0.96 0.97 0.97 1034
8.0 0.96 0.94 0.95 923
9.0 0.96 0.96 0.96 1020
avg / total 0.97 0.97 0.97 10000
Logistic regression on outputs of a stacks of RBMs / a DBN:
Model performance:
precision recall f1-score support
0.0 0.99 0.99 0.99 995
1.0 0.99 0.99 0.99 1121
2.0 0.97 0.98 0.98 1015
3.0 0.98 0.97 0.97 1033
4.0 0.98 0.97 0.98 976
5.0 0.96 0.97 0.97 884
6.0 0.99 0.98 0.98 999
7.0 0.98 0.98 0.98 1034
8.0 0.98 0.97 0.97 923
9.0 0.96 0.97 0.96 1020
avg / total 0.98 0.98 0.98 10000
Code
#!/usr/bin/env python
"""
Using MNIST, compare classification performance of:
1) logistic regression by itself,
2) logistic regression on outputs of an RBM, and
3) logistic regression on outputs of a stacks of RBMs / a DBN.
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import BernoulliRBM
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
def norm(arr):
arr = arr.astype(np.float)
arr -= arr.min()
arr /= arr.max()
return arr
if __name__ == '__main__':
# load MNIST data set
mnist = fetch_mldata('MNIST original')
X, Y = mnist.data, mnist.target
# normalize inputs to 0-1 range
X = norm(X)
# split into train, validation, and test data sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=10000, random_state=0)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=10000, random_state=0)
# --------------------------------------------------------------------------------
# set hyperparameters
learning_rate = 0.02 # from Erhan et el. (2010): median value in grid-search
total_units = 800 # from Erhan et el. (2010): optimal for MNIST / only slightly worse than 1200 units when using InfiniteMNIST
total_epochs = 50 # from Erhan et el. (2010): optimal for MNIST
batch_size = 128 # seems like a representative sample; backprop literature often uses 256 or 512 samples
C = 100. # optimum for benchmark model according to sklearn docs: https://scikit-learn.org/stable/auto_examples/neural_networks/plot_rbm_logistic_classification.html#sphx-glr-auto-examples-neural-networks-plot-rbm-logistic-classification-py)
# TODO optimize using grid search, etc
# --------------------------------------------------------------------------------
# construct models
# RBM
rbm = BernoulliRBM(n_components=total_units, learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1)
# "output layer"
logistic = LogisticRegression(C=C, solver='lbfgs', multi_class='multinomial', max_iter=200, verbose=1)
models = []
models.append(Pipeline(steps=[('logistic', clone(logistic))])) # base model / benchmark
models.append(Pipeline(steps=[('rbm1', clone(rbm)), ('logistic', clone(logistic))])) # single RBM
models.append(Pipeline(steps=[('rbm1', clone(rbm)), ('rbm2', clone(rbm)), ('logistic', clone(logistic))])) # RBM stack / DBN
# --------------------------------------------------------------------------------
# train and evaluate models
for model in models:
# train
model.fit(X_train, Y_train)
# evaluate using validation set
print("Model performance:\n%s\n" % (
classification_report(Y_val, model.predict(X_val))))
# TODO: after parameter optimization, evaluate on test set