Reconstructing Numbers

I’m working on finishing up the code for the final 30 Days of Python project, saving the whales, but I took a detour to work with the MNIST handwritten digits again. I did this because they are great dataset for testing because it’s obvious what you are training on and its also well known how well an algorithm should do.

I’ve been implementing python code that trains restricted boltzmann machines (RBMs), deep belief nets (DBNs), and deep neural networks (DNNs). I originally implemented this in Matlab (as part of a coursera course) and then ported it to R (because it’s free, open source, and good for stats) and am now bringing it into Python. I chose to bring this code into python for a couple of reasons: I’m using python as my go to language, I want learn how to use numpy effectively, and I’ve become a fan of how python manages code. There are plenty of other reasons but I will caveat all of this with the fact that I know there are available RBM/DBN/DNN implementations available but I want to make my own as a learning exercise and because I like having control of the finer details of the model.

A Brief Intro to RBMs, DBNs, and DNNs

A Restricted Boltzmann Machine (RBM) is a model that has two layers: an input layer (called the visible units) and a hidden layer. Each layer is made up of “units” which can take on either a 1 or a 0. For a given input, the hidden layer can be computed using a weight matrix, a bias term, and the logistic function: h = logistic(W*v+b_fwd). Because the inputs are often real valued and the computation ends up with real values on the interval from 0.0 to 1.0 the values are treated as probabilities that a given unit will be ‘on’ or ‘off.’

The interesting thing about RBMs is they are trained to reconstruct their inputs. So instead of trying to tie the output layer to a digit label, we try to reconstruct the digit image. The theory being that if the model can reconstruct the input then the features learned will be useful in determining the label later. To calculate a reconstruction of the visible layer use the transpose of the weight matrix, a reverse bias term, and the logistic function: v = logistic(t(W)*h + b_rev). Calculating the hidden units from the visible is a forward pass and calculating the visible from the hidden is a reverse pass. The optimal way to train these models is quite slow but Geoffery Hinton found a shortcut called contrastive divergence:

  1. Compute a forward pass (v0 -> h0)
  2. Compute a reverse pass (h0 -> v1)
  3. Compute another forward pass (v1 -> h1)
  4. Calculate a weight update rule: delta_W = h0*t(v0) - h1*t(v1)

The full version of contrastive divergence requires the back and forth to go onto infinity but it turns out even one pass gives a good weight update rule that can be used in gradient descent.

RBMs can be stacked by using the output of one layer as the input to the next. Stacked RBMs are mathematically equivalent to Deep Belief Nets (DBNs) and it’s much faster to train them by doing “greedy” training one layer at a time. There are ways to train the DBN to improve the generative ability of the model and tie in the labels but I’m not interested in them for that.

The reason I use RBMs is you can initialize a deep neural network (DNN) with the same coefficients as the DBN. The two changes to make are discarding the reverse biases since the DNN is one way and adding an extra layer to connect to the labels. I follow what I learned in Hinton’s coursera class and use a softmax layer. A softmax layer is like a logistic layer except that it is normalized so that you can interpret it as a probability distribution. For the 10 MNIST digits there will always be 10 units in the softmax layer. For a digit of 2, unit 2 will have a high value (near 1.0) and the others will be near 0.0, which can be interpreted as high probability of it being a digit 2. For any input the sum of all the output units will be 1.0. The computation of a forward pass in the DNN is just a succession of forward passes of each layer. The model can be updated through gradient descent using the backprop method.

Crucially because the DNN is initialized with pre-trained RBMs only a few passes of backprop are needed to fine tune the model for discrimination. Virtually all of the training time is spent in the pre-training without the labels minimizing the amount of time the model has to overfit the training dataset.

Visualizing the Learning: Reconstructed Digits

One way to see how well the model learned from the data is to reconstruct the input data through the model. What I mean is set the input data to be an image of a digit, compute the hidden layer through a forward pass of the model, and then compute a reverse pass to get a reconstructed image. For a random model this results in white noise (think snow on the TV). Below is the result for a forward pass to the first layer and into the second layer followed by reverse pass by both. The reconstructed image is quite good meaning the model has the ability to represent the data well. Having worked with RBMs on multiple projects, I highly recommend this technique to verify that the model has learned something. It very easily points bugs in the code (my first reconstructions looked identical) or points to issues with trying to reconstruct the data this way.

Reconstruction of a digit through the RBMs

Reconstruction of digits through the RBMs

To explore what the model actually learned we can reconstruct an image from just the bias unit (i.e. turn off all of the hidden units). The reconstructed image is similar to the average of all of the images the model sees:

Digit Reconstruction from just the Bias compared to the Average Digit

Digit reconstruction from just the bias compared to the mean digit

Another way to see what’s going on is to turn on a single hidden unit in a layer and see what it constructs. The resulting image is essentially what would drive this hidden unit as close to 1 as possible. It shows what the model is paying attention too:

Reconstruction from a single hidden unit in the first layer

Reconstruction from a single hidden unit in the first layer

A dark area means it’s a higher value so you can see that these units tend have a very active curve somewhere in the image and very inactive spot next to it. What that means is each unit is triggering off of part of a digit. Notice that there aren’t really dark areas around the edges because the digits are centered in the image. Here’s a reconstruction from a single unit in the second layer:

Reconstruction from a single hidden unit in the seond layer

Reconstruction from a single hidden unit in the second layer

It gets harder to see what’s going on at this point but the main thing that is visible is that some of these have more than one area active and others have very tight localized spots that are active. Getting these plots for the first time wasn’t exactly straight forward so I thought I’d share my code:

# %% Probe the bias units
probe1 = dbn.rbm_rev(model_rbm1, np.matrix(np.zeros((N_hid1, 1))))
probe2 = dbn.rbm_rev(model_rbm1, dbn.rbm_rev(model_rbm2, np.matrix(np.zeros((N_hid2, 1)))))

plt.subplot(1, 3, 1)
plt.imshow(np.reshape(np.mean(train['data'],axis=1),image_shape),, interpolation='nearest')
plt.title('Mean Image')
plt.subplot(1, 3, 2)

plt.imshow(np.reshape(probe1,image_shape),, interpolation='nearest')
plt.title('Rev Bias 1')
plt.subplot(1, 3, 3)
plt.imshow(np.reshape(probe2,image_shape),, interpolation='nearest')
plt.title('Rev Bias 2')

# %% Probe the first RBM
bias1_probe = dbn.rbm_rev(model_rbm1, np.matrix(np.zeros((N_hid1, N_hid1))))
W1_probe = dbn.rbm_rev(model_rbm1, np.matrix(np.eye(N_hid1)))
rbm1_probe = W1_probe - bias1_probe

plt_shape1 = (6,12)
for i in range(
    plt.subplot(plt_shape1[0], plt_shape1[1], i+1)
    plt.imshow(np.reshape(rbm1_probe[:,i],image_shape),, interpolation='nearest')

# %% Probe the second RBM
bias2_probe = dbn.rbm_rev(model_rbm1, dbn.rbm_rev(model_rbm2, np.matrix(np.zeros((N_hid2, N_hid2)))))
W2_probe = dbn.rbm_rev(model_rbm1, dbn.rbm_rev(model_rbm2, np.matrix(np.eye(N_hid2))))
rbm2_probe = W2_probe - bias2_probe

plt_shape2 = (6,12)
for i in range(
    plt.subplot(plt_shape2[0], plt_shape2[1], i+1)
    plt.imshow(np.reshape(rbm2_probe[:,i],image_shape),, interpolation='nearest')


I trained each RBM layer with 30 passes through the data, 10 passes for the softmax layer, and then finished up with 10 passes of backpropagation for the whole DNN. The resulting model got me my highest score on the MNIST digit challenge on Kaggle, 97% accurate, which I believe is on par with humans. At the very least it beats my 96% that I got with K Nearest Neighbors.

Stay tuned for a comparison of the Python implementation to the R implementation with a bit about optimization.

p.s. Once my code structure settles down, I plan to update this with a link to github repo for the full thing.


30 Days of Python: Day 20 MNIST Digit Recognition

I’m making a small project every day in python for the next 30 days (minus some vacation days). I’m hoping to learn many new packages and  make a wide variety of projects, including games, computer tools, machine learning, and maybe some science. It should be a good variety and I think it will be a lot of fun.

Day 20: MNIST Digit Recognition

I took a crack at the digit recognition task on Kaggle today. I’m using Python’s scikit-learn package to learn a model for recognizing which hand written digit is present. I included in my code a preview of what the images look like:

Sample MNIST Digits

Sample MNIST Digits


The data features for each image are the 28×28 pixels unwrapped into 784 element array. I tried three different models: Logistic Regression, SVM, and KNN. Something went wrong with the SVM model (I’m guessing it couldn’t handle the integer values I originally gave it), so I didn’t end up using it. Because the other two were working and taking a while to train I didn’t spend much time on debugging it. I got fairly good results initially: LR had 88% precision and KNN had 96% precision.

A feature I used the metrics module’s classification_report to estimate how well my model did before I submitted it. I split the training data into training and validation data, predicted on the validation data without any extra training on it, and the used the classification report to tell me how the model did. For the classification report, the LR came out with 89% and the KNN with 96% which is very close to the actual scores I achieved. Another feature of the metrics module is the confusion_matrix which can show you where the classifications go awry. Here’s the output from the KNN classification report and confusion matrix:

Classification report for classifier KNeighborsClassifier(algorithm=auto, leaf_size=30,
 metric=minkowski, n_neighbors=5, p=2, weights=uniform):

  precision recall f1-score support

0      0.97   0.99     0.98    2013
1      0.94   1.00     0.97    2349
2      0.98   0.95     0.96    2042
3      0.95   0.96     0.96    2199
4      0.97   0.96     0.96    1999
5      0.96   0.95     0.96    1877
6      0.97   0.98     0.98    2122
7      0.95   0.97     0.96    2261
8      0.98   0.91     0.95    2019
9      0.94   0.94     0.94    2119

avg / total 0.96 0.96 0.96 21000

Confusion matrix:
[[2000    2    1    0    0    3    5    1    0    1]
 [   0 2338    4    0    1    1    0    4    0    1]
 [  17   29 1936    5    2    3    4   38    5    3]
 [   3    8   13 2121    0   22    1   10   11   10]
 [   2   23    0    0 1918    0    8    4    0   44]
 [  12    7    0   30    3 1778   27    2    6   12]
 [  16    8    1    0    3    7 2086    0    1    0]
 [   0   30    7    1    4    0    0 2189    0   30]
 [   7   23    6   53   16   29   12    8 1846   19]
 [   7   10    3   22   32    3    0   38    7 1997]]

Another feature of scikit-learn that I decided to check out was the preprocessing module, namely the StandardScaler which can learn the mean and variance of the training data and then can be used to center and scale the data to have a mean of 0 and variance of 1. This is useful to avoid fitting to spurious effects in the training data (say all of the ones just happened to have a particular lighting effect). The StandardScaler is as easy to use as the classifiers:

scaler = preprocessing.StandardScaler()
processed_data = scaler.transform(processed_data)
test_data = scaler.transform(np.array(line).astype(

The results for this were mixed. It improved the logistic regression fit to 89% but the KNN scored only a 93%. Again the classification report predicted the results correctly (LR 90%, KNN 93%). Even though the StandardScaler didn’t help much in this case, I’m glad I took the time to learn how to use it. I’m also really glad to find out about the metrics module. I’ve had to make that sort of function before and it’s tricky to get it robust and useful. I’m really glad to see that it’s just included (a great example of the batteries included philosophy of Python). The classification report is a must for Kaggle competitions given the limited number of submissions you get per day. Knowing your performance before you submit is definitely an edge. I’d like to try out either Neural Nets, Restricted Boltzmann Machines, or something else from that family of algorithms on the dataset but that will have to wait for another project.

This will be my last post for a few days as I will be on vacation. But I will be back with the rest of my 30 days!