30 Days of Python: Day 20 MNIST Digit Recognition

I’m making a small project every day in python for the next 30 days (minus some vacation days). I’m hoping to learn many new packages and  make a wide variety of projects, including games, computer tools, machine learning, and maybe some science. It should be a good variety and I think it will be a lot of fun.

Day 20: MNIST Digit Recognition

I took a crack at the digit recognition task on Kaggle today. I’m using Python’s scikit-learn package to learn a model for recognizing which hand written digit is present. I included in my code a preview of what the images look like:

Sample MNIST Digits

Sample MNIST Digits

 

The data features for each image are the 28×28 pixels unwrapped into 784 element array. I tried three different models: Logistic Regression, SVM, and KNN. Something went wrong with the SVM model (I’m guessing it couldn’t handle the integer values I originally gave it), so I didn’t end up using it. Because the other two were working and taking a while to train I didn’t spend much time on debugging it. I got fairly good results initially: LR had 88% precision and KNN had 96% precision.

A feature I used the metrics module’s classification_report to estimate how well my model did before I submitted it. I split the training data into training and validation data, predicted on the validation data without any extra training on it, and the used the classification report to tell me how the model did. For the classification report, the LR came out with 89% and the KNN with 96% which is very close to the actual scores I achieved. Another feature of the metrics module is the confusion_matrix which can show you where the classifications go awry. Here’s the output from the KNN classification report and confusion matrix:

Classification report for classifier KNeighborsClassifier(algorithm=auto, leaf_size=30,
 metric=minkowski, n_neighbors=5, p=2, weights=uniform):

  precision recall f1-score support

0      0.97   0.99     0.98    2013
1      0.94   1.00     0.97    2349
2      0.98   0.95     0.96    2042
3      0.95   0.96     0.96    2199
4      0.97   0.96     0.96    1999
5      0.96   0.95     0.96    1877
6      0.97   0.98     0.98    2122
7      0.95   0.97     0.96    2261
8      0.98   0.91     0.95    2019
9      0.94   0.94     0.94    2119

avg / total 0.96 0.96 0.96 21000

Confusion matrix:
[[2000    2    1    0    0    3    5    1    0    1]
 [   0 2338    4    0    1    1    0    4    0    1]
 [  17   29 1936    5    2    3    4   38    5    3]
 [   3    8   13 2121    0   22    1   10   11   10]
 [   2   23    0    0 1918    0    8    4    0   44]
 [  12    7    0   30    3 1778   27    2    6   12]
 [  16    8    1    0    3    7 2086    0    1    0]
 [   0   30    7    1    4    0    0 2189    0   30]
 [   7   23    6   53   16   29   12    8 1846   19]
 [   7   10    3   22   32    3    0   38    7 1997]]

Another feature of scikit-learn that I decided to check out was the preprocessing module, namely the StandardScaler which can learn the mean and variance of the training data and then can be used to center and scale the data to have a mean of 0 and variance of 1. This is useful to avoid fitting to spurious effects in the training data (say all of the ones just happened to have a particular lighting effect). The StandardScaler is as easy to use as the classifiers:

...
scaler = preprocessing.StandardScaler()
scaler.fit(processed_data)
processed_data = scaler.transform(processed_data)
...
test_data = scaler.transform(np.array(line).astype(np.int))

The results for this were mixed. It improved the logistic regression fit to 89% but the KNN scored only a 93%. Again the classification report predicted the results correctly (LR 90%, KNN 93%). Even though the StandardScaler didn’t help much in this case, I’m glad I took the time to learn how to use it. I’m also really glad to find out about the metrics module. I’ve had to make that sort of function before and it’s tricky to get it robust and useful. I’m really glad to see that it’s just included (a great example of the batteries included philosophy of Python). The classification report is a must for Kaggle competitions given the limited number of submissions you get per day. Knowing your performance before you submit is definitely an edge. I’d like to try out either Neural Nets, Restricted Boltzmann Machines, or something else from that family of algorithms on the dataset but that will have to wait for another project.

This will be my last post for a few days as I will be on vacation. But I will be back with the rest of my 30 days!

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s