Optimization with IPython

I just discovered the easy optimization features of IPython and thought I’d share how I used them to speed up my deep learning code. For a full explanation of the deep learning code see my previous posts about some projects that I’m building it for. This is going to be a speed comparison of the original R code, my first implementation in Python, and finally the optimized version in Python.

Over that last few weeks I’ve been exploring the various features of the IPython shell. I like its tab-to-complete, its easy help syntax (?), its magic commands (with %), and its ability to access the shell (with !). But I’m really excited about the latest set of features I found: the optimization tools. You can profile your code with just a single line and not have to worry about re-writing all of the code to do it. I’ve optimized code before (usually with timers) and I’ve used profilers before like the built-in one for Netbeans with Java and I know that the profile package and timeit package are python packages not IPython but the ease of use is really what does it for me. I don’t want to invest a ton of time and effort into finding out that my code is already as fast as its going to get or that I’m looking in the wrong place. I want an answer now and I want to be able to play around with it to dig into what’s going on. IPython provides that with %timeit and %prun. Here’s how I used these commands to get my 30 hour python code down to 6 minutes.

The Problem:

First off the code I was optimizing trains a machine learning model using gradient descent. To do that I have two key functions with many supporting ones: optimize and a gradient function (cd1 in this case). The optimize function performs gradient descent using the model and gradient function. It takes a chunk of data calls the gradient function and then applies a weight update to the model over and over again. Here’s the code:


def optimize(model, gradient_function, training_data, momentum=0.9, learning_rate=0.1, N_epochs=10, mini_batch_size=100):
    '''Trains a model with gradient descent'''
    model = {key:np.copy(val) for key, val in model.iteritems()}
    momentum_speed = {key:np.zeros(np.shape(val)) for key, val in model.iteritems()}
    data_len = np.shape(training_data['data'])[1]
    for epoch in range(N_epochs):
        for mini_batch in extract_mini_batch(training_data, mini_batch_size):
            gradient = gradient_function(model, mini_batch)
            for key in model.iterkeys():
                momentum_speed[key] = momentum*momentum_speed[key] + gradient[key]
                model[key] += momentum_speed[key]*learning_rate#-weightcost*model[key]
    return model

def cd1(model, data):
    '''Computes one iteration of contrastive divergence on the rbm model'''
    N_cases = np.shape(data['data'])[1]
    
    vis_prob_0 = data['data']
    vis_states_0 = sample_bernoulli(vis_prob_0)
    
    hid_prob_0 = logistic(model['W']*vis_states_0 + model['fwd_bias'])
    hid_states_0 = sample_bernoulli(hid_prob_0)
    
    vis_prob_n = logistic(model['W'].T*hid_states_0 + model['rev_bias'])
    vis_states_n = sample_bernoulli(vis_prob_n)
    
    hid_prob_n = logistic(model['W']*vis_states_n + model['fwd_bias'])
    
    vh0 = hid_states_0 * vis_states_0.T / N_cases
    vh1 = hid_prob_n * vis_states_n.T / N_cases
    cd1_value = vh0 - vh1
    
    model_delta = dict([('W', cd1_value),
                        ('fwd_bias',np.mean(hid_states_0 - hid_prob_n, axis=1)),
                        ('rev_bias',np.mean(vis_prob_0 - vis_prob_n, axis=1))])
    return model_delta

When I first ran this code on to train a model on the full MNIST data for the Kaggle competition it took 30 hours to complete the training on the 42000 data samples. I could have just accepted that but I had a suspicion that it could be faster because I was pretty sure the R code version ran faster than that. So I setup a fair comparison and implemented the exact same problem in R with the old code. It ran in 30 minutes.

Before I found the IPython tools, I recorded a bunch of data to see how fast the code was running with different amounts of data. I found that the R code scaled linearly while the Python code scaled exponentially. I also found that if instead of splitting the data up into mini-batches, I kept it all as one batch the Python was linear and faster than the R. All of this led me to scour the internet for advise on optimizing python code, specifically numpy code. I found some good tips from the scikit-learn site, scipy lectures, and a good example of the results of optimization.

Profiling in IPython:

First I timed my function on a small subset of the data (700 samples, with 100 samples per mini-batch). I kept this for comparison later to see if the changes really do help the problem. To do this I just put %timeit in front of my call to the optimize function.


%timeit model_rbm1 = dbn.optimize(model_rbm1_init, dbn.cd1, train, learning_rate=lr_pretrain, N_epochs=N_epochs_pretrain)
1 loops, best of 3: 18.6 s per loop

Next I run the profiler, which will compute how much time is spent in each function that gets called. For this I just add %prun in front of the command (-l 10 limits it to the top 10 calls):


#2000 cases

%prun -l 10 model_rbm1 = dbn.optimize(model_rbm1_init, dbn.cd1, train, learning_rate=lr_pretrain, N_epochs=N_epochs_pretrain)
   106298 function calls in 150.431 seconds

   Ordered by: internal time
   List reduced from 38 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
     6597   44.723    0.007   44.737    0.007  {numpy.core._dotblas.dot}
     1800   36.143    0.020   36.178    0.020  dbn.py:10(logistic)
     1800   28.197    0.016   28.197    0.016  {method 'rand' of 'mtrand.RandomState' objects}
      600   20.137    0.034  146.362    0.244  dbn.py:31(cd1)
     1800   11.819    0.007   44.837    0.025  dbn.py:26(sample_bernoulli)
     1800    4.798    0.003    4.808    0.003  {method 'astype' of 'numpy.ndarray' objects}
        1    3.174    3.174  150.431  150.431  dbn.py:69(optimize)
     1200    1.201    0.001    1.207    0.001  {method 'reduce' of 'numpy.ufunc' objects}
    26397    0.090    0.000    0.101    0.000  defmatrix.py:290(__array_finalize__)
     1200    0.040    0.000    1.267    0.001  _methods.py:49(_mean)

The output is ordered by tottime, which is the total time spent in that function not counting functions that it calls. From this I can see that most time is spent in the numpy dot product routine as expected and the next most is in the logistic function that I wrote. So that’s what I investigated next. Now to do this I had to seperate it out from the code that called it but pass it some dummy data to take the arguments place. I used numpy.rand for that.

#for loop to get the number of calls comparable to dbn.optimize
%prun for i in range(2000): dbn.logistic(np.dot(np.random.rand(300,784),np.random.rand(784,100)) + np.random.rand(300,1))
   10003 function calls in 10.278 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
     6000    7.535    0.001    7.535    0.001  {method 'rand' of 'mtrand.RandomState' objects}
     2000    1.585    0.001    1.585    0.001  {numpy.core._dotblas.dot}
     2000    0.846    0.000    0.846    0.000  dbn.py:10(logistic)
        1    0.311    0.311   10.278   10.278  <string>:1(<module>)
        1    0.000    0.000    0.000    0.000  {range}
        1    0.000    0.000    0.000    0.000  {method 'disable' of '_lsprof.Profiler' objects}

No matter what I tried I couldn’t get logistic to be as slow as it was in the optimize routine (cumulative time of 0.9s vs. 36.2s) so I backed up and tried timing the function that calls it: cd1. I figured there must be something about how it is called that slows it down. Here’s that profile run:

#for loop to get the number of calls comparable to dbn.optimize
%prun for i in range(0, data_len, mini_batch_size)*30: v = dbn.cd1(model_rbm1, {'data':(train['data'][:,i:i+mini_batch_size])})
    81805 function calls in 3.944 seconds

    Ordered by: internal time

    ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
      3000    1.327    0.000    1.330    0.000  {numpy.core._dotblas.dot}
       600    1.264    0.002    3.733    0.006  dbn.py:31(cd1)
      1800    0.462    0.000    0.470    0.000  dbn.py:10(logistic)
      1800    0.459    0.000    0.459    0.000  {method 'rand' of 'mtrand.RandomState' objects}
         1    0.204    0.204    3.944    3.944  <string>:1(<module>)
      1800    0.073    0.000    0.579    0.000  dbn.py:26(sample_bernoulli)
      1800    0.042    0.000    0.044    0.000  {method 'astype' of 'numpy.ndarray' objects}
      1200    0.032    0.000    0.034    0.000  {method 'reduce' of 'numpy.ufunc' objects}
     21600    0.023    0.000    0.027    0.000  defmatrix.py:290(__array_finalize__)
      1200    0.018    0.000    0.062    0.000  _methods.py:49(_mean)

Again this isn’t as slow as it was when it was part of the optimize call (cumulative time of 3.7s vs. 146s) but working up this path seemed promising so I looked at how cd1 was called in the optimize function:

#gradient_function = cd1 for this call to optimize
for mini_batch in extract_mini_batch(training_data, mini_batch_size):
    gradient = gradient_function(model, mini_batch)
    ...

There’s nothing especially interesting about the gradient function line, but the for loop that feeds it uses a generator function. I tried swapping that out for a plain slicing of the training data and voila:

%timeit model_rbm1 = dbn.optimize(model_rbm1_init, dbn.cd1, train, learning_rate=lr_pretrain, N_epochs=N_epochs_pretrain)
1 loops, best of 3: 3.18 s per loop
# compared to 18.6s per loop before

The generator function was the culprit, not because it consumes a lot of resources (they are actually very efficient), but because I implemented it poorly:

#before
def extract_mini_batch(data, mini_batch_size=100):
    '''Generates mini_batches from the data'''
    data_len = np.shape(data['data'])[1]
    return (dict([(key,
                   val[:,i:i+mini_batch_size] if len(np.shape(val))>2 and np.shape(val)[1]==data_len else val)
                   for key, val in data.iteritems()])
            for i in xrange(0, data_len, mini_batch_size))

#after
def extract_mini_batch(data, mini_batch_size=100):
    '''Generates mini_batches from the data'''
    data_len = np.shape(data['data'])[1]
    return (dict([(key, val[:,i:i+mini_batch_size])
                   for key, val in data.iteritems()])
            for i in xrange(0, data_len, mini_batch_size))

Using an if statement, even the use of the ternary operator, breaks the ability of the interpreter to compile this into efficient code. I had to give up the ability to store some extra variables with the data but they were extraneous and the optimization is definitely worth it. Plus it is more readable after the change.

Results

Below are the results of the optimization. You can see the exponential curve of the original Python code and the improved linearly scaling Python code. Also note that R code scales worse than the Python code (linearly but not as good of a coefficient). A fully timed run for the R code takes 30 minutes while the new Python code only takes 6 minutes for all 42000 data samples.

Python vs  R Deep Learning

While it may be possible to get further improvements out of Python and R, those improvements would be hard to find, at least for me. I’m satisfied with how fast it runs and am glad to know how easy it is to profile code in python. I had been wondering for a while now how Python and R compared in speed and while there are resources out there testing it myself let’s me know what the level of effort is to get good fast code. I think Python wins on that one because between the straight forward numpy matrix syntax and the handy profiling commands it really only took me a couple days to clean up the code and get it running fast. It took a look more work to get the R code running as it is.

 

Advertisements

Reconstructing Numbers

I’m working on finishing up the code for the final 30 Days of Python project, saving the whales, but I took a detour to work with the MNIST handwritten digits again. I did this because they are great dataset for testing because it’s obvious what you are training on and its also well known how well an algorithm should do.

I’ve been implementing python code that trains restricted boltzmann machines (RBMs), deep belief nets (DBNs), and deep neural networks (DNNs). I originally implemented this in Matlab (as part of a coursera course) and then ported it to R (because it’s free, open source, and good for stats) and am now bringing it into Python. I chose to bring this code into python for a couple of reasons: I’m using python as my go to language, I want learn how to use numpy effectively, and I’ve become a fan of how python manages code. There are plenty of other reasons but I will caveat all of this with the fact that I know there are available RBM/DBN/DNN implementations available but I want to make my own as a learning exercise and because I like having control of the finer details of the model.

A Brief Intro to RBMs, DBNs, and DNNs

A Restricted Boltzmann Machine (RBM) is a model that has two layers: an input layer (called the visible units) and a hidden layer. Each layer is made up of “units” which can take on either a 1 or a 0. For a given input, the hidden layer can be computed using a weight matrix, a bias term, and the logistic function: h = logistic(W*v+b_fwd). Because the inputs are often real valued and the computation ends up with real values on the interval from 0.0 to 1.0 the values are treated as probabilities that a given unit will be ‘on’ or ‘off.’

The interesting thing about RBMs is they are trained to reconstruct their inputs. So instead of trying to tie the output layer to a digit label, we try to reconstruct the digit image. The theory being that if the model can reconstruct the input then the features learned will be useful in determining the label later. To calculate a reconstruction of the visible layer use the transpose of the weight matrix, a reverse bias term, and the logistic function: v = logistic(t(W)*h + b_rev). Calculating the hidden units from the visible is a forward pass and calculating the visible from the hidden is a reverse pass. The optimal way to train these models is quite slow but Geoffery Hinton found a shortcut called contrastive divergence:

  1. Compute a forward pass (v0 -> h0)
  2. Compute a reverse pass (h0 -> v1)
  3. Compute another forward pass (v1 -> h1)
  4. Calculate a weight update rule: delta_W = h0*t(v0) - h1*t(v1)

The full version of contrastive divergence requires the back and forth to go onto infinity but it turns out even one pass gives a good weight update rule that can be used in gradient descent.

RBMs can be stacked by using the output of one layer as the input to the next. Stacked RBMs are mathematically equivalent to Deep Belief Nets (DBNs) and it’s much faster to train them by doing “greedy” training one layer at a time. There are ways to train the DBN to improve the generative ability of the model and tie in the labels but I’m not interested in them for that.

The reason I use RBMs is you can initialize a deep neural network (DNN) with the same coefficients as the DBN. The two changes to make are discarding the reverse biases since the DNN is one way and adding an extra layer to connect to the labels. I follow what I learned in Hinton’s coursera class and use a softmax layer. A softmax layer is like a logistic layer except that it is normalized so that you can interpret it as a probability distribution. For the 10 MNIST digits there will always be 10 units in the softmax layer. For a digit of 2, unit 2 will have a high value (near 1.0) and the others will be near 0.0, which can be interpreted as high probability of it being a digit 2. For any input the sum of all the output units will be 1.0. The computation of a forward pass in the DNN is just a succession of forward passes of each layer. The model can be updated through gradient descent using the backprop method.

Crucially because the DNN is initialized with pre-trained RBMs only a few passes of backprop are needed to fine tune the model for discrimination. Virtually all of the training time is spent in the pre-training without the labels minimizing the amount of time the model has to overfit the training dataset.

Visualizing the Learning: Reconstructed Digits

One way to see how well the model learned from the data is to reconstruct the input data through the model. What I mean is set the input data to be an image of a digit, compute the hidden layer through a forward pass of the model, and then compute a reverse pass to get a reconstructed image. For a random model this results in white noise (think snow on the TV). Below is the result for a forward pass to the first layer and into the second layer followed by reverse pass by both. The reconstructed image is quite good meaning the model has the ability to represent the data well. Having worked with RBMs on multiple projects, I highly recommend this technique to verify that the model has learned something. It very easily points bugs in the code (my first reconstructions looked identical) or points to issues with trying to reconstruct the data this way.

Reconstruction of a digit through the RBMs

Reconstruction of digits through the RBMs

To explore what the model actually learned we can reconstruct an image from just the bias unit (i.e. turn off all of the hidden units). The reconstructed image is similar to the average of all of the images the model sees:

Digit Reconstruction from just the Bias compared to the Average Digit

Digit reconstruction from just the bias compared to the mean digit

Another way to see what’s going on is to turn on a single hidden unit in a layer and see what it constructs. The resulting image is essentially what would drive this hidden unit as close to 1 as possible. It shows what the model is paying attention too:

Reconstruction from a single hidden unit in the first layer

Reconstruction from a single hidden unit in the first layer

A dark area means it’s a higher value so you can see that these units tend have a very active curve somewhere in the image and very inactive spot next to it. What that means is each unit is triggering off of part of a digit. Notice that there aren’t really dark areas around the edges because the digits are centered in the image. Here’s a reconstruction from a single unit in the second layer:

Reconstruction from a single hidden unit in the seond layer

Reconstruction from a single hidden unit in the second layer

It gets harder to see what’s going on at this point but the main thing that is visible is that some of these have more than one area active and others have very tight localized spots that are active. Getting these plots for the first time wasn’t exactly straight forward so I thought I’d share my code:

# %% Probe the bias units
probe1 = dbn.rbm_rev(model_rbm1, np.matrix(np.zeros((N_hid1, 1))))
probe2 = dbn.rbm_rev(model_rbm1, dbn.rbm_rev(model_rbm2, np.matrix(np.zeros((N_hid2, 1)))))

plt.subplot(1, 3, 1)
plt.axis('off')
plt.imshow(np.reshape(np.mean(train['data'],axis=1),image_shape), cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Mean Image')
plt.subplot(1, 3, 2)
plt.axis('off')

plt.imshow(np.reshape(probe1,image_shape), cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Rev Bias 1')
plt.subplot(1, 3, 3)
plt.axis('off')
plt.imshow(np.reshape(probe2,image_shape), cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Rev Bias 2')

# %% Probe the first RBM
bias1_probe = dbn.rbm_rev(model_rbm1, np.matrix(np.zeros((N_hid1, N_hid1))))
W1_probe = dbn.rbm_rev(model_rbm1, np.matrix(np.eye(N_hid1)))
rbm1_probe = W1_probe - bias1_probe

plt_shape1 = (6,12)
plt.figure(figsize=(15.,9.))
for i in range(np.prod(plt_shape1)):
    plt.subplot(plt_shape1[0], plt_shape1[1], i+1)
    plt.axis('off')
    plt.imshow(np.reshape(rbm1_probe[:,i],image_shape), cmap=plt.cm.gray_r, interpolation='nearest')

# %% Probe the second RBM
bias2_probe = dbn.rbm_rev(model_rbm1, dbn.rbm_rev(model_rbm2, np.matrix(np.zeros((N_hid2, N_hid2)))))
W2_probe = dbn.rbm_rev(model_rbm1, dbn.rbm_rev(model_rbm2, np.matrix(np.eye(N_hid2))))
rbm2_probe = W2_probe - bias2_probe

plt_shape2 = (6,12)
plt.figure(figsize=(15.,9.))
for i in range(np.prod(plt_shape2)):
    plt.subplot(plt_shape2[0], plt_shape2[1], i+1)
    plt.axis('off')
    plt.imshow(np.reshape(rbm2_probe[:,i],image_shape), cmap=plt.cm.gray_r, interpolation='nearest')

Results

I trained each RBM layer with 30 passes through the data, 10 passes for the softmax layer, and then finished up with 10 passes of backpropagation for the whole DNN. The resulting model got me my highest score on the MNIST digit challenge on Kaggle, 97% accurate, which I believe is on par with humans. At the very least it beats my 96% that I got with K Nearest Neighbors.

Stay tuned for a comparison of the Python implementation to the R implementation with a bit about optimization.

p.s. Once my code structure settles down, I plan to update this with a link to github repo for the full thing.

30 Days of Python: Day 30 Saving the Whales

I’ve made a small project every day in python for the past 30 days (minus some vacation days). I’ve learned many new packages and  made a wide variety of projects, including games, computer tools, machine learning, and some science. It was a lot of fun.

Day 30: Saving the Whales

For my last project, I thought I would revisit the Whale Detection Competition on Kaggle that I competed in last year. The goal was to train a model that could detect the presence of a whale call in an audio file. The data was provided by Cornell and consisted of 2 second sound recordings  from buoys out in Massachusetts Bay of either background ocean noise or a Right whale’s “up call” which starts low and rises up. The Right whale is endangered (only 400 left) and doesn’t call out very often so it can be harder to detect than, say, a Humpback whale, so better detection algorithms will help save the whales from being hit by shipping traffic.

I did pretty well last year in the competition, scoring 0.95 Area Under the Curve score (AUC) where perfect would be 1.0. I utilized the deep learning models that I learned in Geoffery Hinton’s Coursera course on Neural Networks to build the models but I did all of the work in R, which brings me to the present project.

My main tool over the years for data analysis has been Matlab at school and then at work, but last year, I learned R as an open source alternative. I took the deep belief net code that I learned from the Hinton course and retooled it from Matlab to R. I added evaluation features and hyper-parameters  for controlling various learning rates and just in general kept developing that code to work on further projects.

But the R code was clunky and messy. It got harder and harder to add new features each building on previous functions. Additionally a lot of the algorithms I wanted to try out and learn were in Matlab or Python (for instance the winning solution to the whale detection challenge). This was one of the big motivations behind learning Python. So in order to fully transition from R to Python I thought I would take the time to rework the Whale detection code into Python and learn about the various data tools in the process.

Fair warning: I didn’t finish the deep learning portion of the project, but I walk through what I did complete and show how a simple model fairs with the feature set that I constructed.

Pre-processing and Feature Extraction:

The data is a zipped up folder of .aiff files, so the first thing that’s necessary to build a program to read in the files and extract whatever features are needed for the model. In R there was no direct way to read in a .aiff file so I had to run the sox tool in a .bat file to convert the files into .wav files. To my delight not only is there an easy way to read .aiff files in Python, but it is part of the standard modules – batteries included so to speak.

With the file ingested, I converted it to a numpy array and then used matplotlib to plot a spectrogram of the audio file. A spectrogram is way of examining the spectrum of a time signal as it evolves over time. Specifically it takes short chunks of time, computes the FFT, and then plots these snapshots of the spectrum on the y axis versus time on the x axis (amplitude of the spectrum is intensity in the image).

Here’s how to do that in code:

plt.figure(figsize=(18.,12.))
for i, file_name in enumerate(file_names[j*N_plot:(j+1)*N_plot]):
    f = aifc.open(os.path.join(data_loc,train_folder,file_name), 'r')
    str_frames = f.readframes(f.getnframes())
    Fs = f.getframerate()
    time_data = np.fromstring(str_frames, np.short).byteswap()
    f.close()

    # spectrogram of file
    plt.subplot(N_plot/4, 4, i+1)
    Pxx, freqs, bins, im = plt.specgram(time_data,Fs=Fs,noverlap=90,cmap=plt.cm.gist_heat)
    plt.title(file_name+' '+file_name_to_labels[file_name])

Instead of just looking at one file, which might not be a great example and would only show either a whale call or not a whale call, I used matplotlib to tile multiple images to get a better sense of the data. This was way easier to do then my experiences with R and being able to easily control its size was easier than Matlab.

Here’s what some of those look like:

Whale Call Spectrograms

Right Whale Call Spectrograms (calls are labelled 1)

To make this useable as inputs to the data model, I needed the raw data that went into the image that matplotlib created, which was readily available in the data returned by the specgram function. As is this would yield almost 3000 features per audio file which is too much for my computer to handle (there are 30,000 audio files). So I used the frequency vector and bin vector (time) to eliminate the lowest and highest frequencies as well as the beginning and ending of each clip. The result was reduced to 600 features per clip, which is more manageable.

I turned the plotting routine into a function, wrapped that in a list comprehension which looped over each file in the the list of files and finally constructed a numpy array out of the resulting list. I used cPickle to save this to disk so I wouldn’t need to repeat it. This portion of the project took me a while to do since I had never done any of these operations before and my original whales project was quite a while ago.

Building Restricted Boltzmann Machines and Deep Belief Nets

Unfortunately, I ran out of time and was unable to complete the conversion of the stacked RBM code. I did however complete the optimize function that could perform the model updates and I was able to verify that the RBM executed properly (although I couldn’t test its efficacy).

My deep learning model is a Neural Network constructed from a Deep Belief Net, which in turn is made of stacked Restricted Boltzmann machines. Restricted Boltzmann machines (RBM) are like one layer of a neural network but they are trained in a special way, Contrastive Divergence, that doesn’t require the data labels. This unsupervised learning algorithm seeks to improve the ability of the RBM to represent the data by training it to reconstruct the the input data from the hidden layer of the network. For a better explanation of why this works, I recommend Hinton’s homepage which is full of his papers and lectures.

Here’s the Python version of the Contrastive Divergence algorithm:


def logistic(x):
    '''Computes the logistic'''
    return 1./(1 + np.exp(-x))

def sample_bernoulli(probabilities):
    '''Samples from a bernoulli distribution for each element of the matrix'''
    return np.greater(probabilities, np.random.rand(*np.shape(probabilities))).astype(np.float)

def cd1(model, visible_data):
    '''Computes one iteration of contrastive divergence on the rbm model'''
    N_cases = np.shape(visible_data)[1]

    #forward propagation of the inputs
    vis_prob_0 = visible_data
    vis_states_0 = sample_bernoulli(vis_prob_0)

    hid_prob_0 = logistic(model['W']*vis_states_0 + model['fwd_bias'])
    hid_states_0 = sample_bernoulli(hid_prob_0)

    #reverse propagation to reconstruct the inputs
    vis_prob_n = logistic(model['W'].T*hid_states_0 + model['rev_bias'])
    vis_states_n = sample_bernoulli(vis_prob_n)

    hid_prob_n = logistic(model['W']*vis_states_n + model['fwd_bias'])

    #compute how good the reconstruction was
    vh0 = hid_states_0 * vis_states_0.T / N_cases
    vh1 = hid_prob_n * vis_states_n.T / N_cases

    cd1_value = vh0 - vh1

    model_delta = dict([('W', cd1_value),
                        ('fwd_bias',np.mean(hid_states_0 - hid_prob_n, axis=1)),
                        ('rev_bias',np.mean(visible_data - vis_prob_n, axis=1))])
    return model_delta

A Deep Belief Network (DBN) is a stacked up version of pre-trained RBM models which can then be treated as a Neural Network and fine tuned by the standard back propagation algorithm using the data labels. Because the model is pre-trained on the data, the back propagation step doesn’t have to change as much to get a good model and because the pre-training didn’t use the labels it is less likely to overfit.

K Nearest Neighbors

To achieve some closure in this project, I ran the feature set that I built through the K nearest neighbors algorithm in scikit-learn. The results weren’t great but they were about the same as the Conrell Benchmark for the competition. I really like how easy it is to get all of the reporting tools so easily in python:

Classification report for classifier KNeighborsClassifier(algorithm=auto, leaf_size=30,
metric=minkowski, n_neighbors=5, p=2, weights=uniform):
            precision recall f1-score support
          0      0.84   0.90     0.87   11286
          1      0.61   0.48     0.54    3714

avg / total      0.78   0.79     0.79   15000

Confusion matrix:
    [[10119 1167]
     [ 1923 1791]]
AUC:
0.689413478377

Conclusions

Overall I am very pleased with writing these algorithms in Python. I had to jump through many hoops to get the matrices to work right when I wrote it in R. For Python, the ability to use numpy algorithms over and over again in clear and simple ways was quite nice. I think I was more slowed down by reading my old code in R then writing the Python version, although testing each bit of code to make sure it was right did also take some time.  I decided to end this project early because I knew I wouldn’t be able to write good Python code if I rushed it any more than I already had, and I plan on using Python for a while so it was better to get it right. Rest assured I will follow up with the completion of the conversion to Python; after all, the Whale Detection Challenge inspired half of this blog’s name.

Final thoughts for 30 Days of Python

These 30 days have been a great experience. I did find the process quite exhausting at times and I wasn’t always sure I would get through it. I came out the other side though with a lot more knowledge of how to do useful things in Python. I’ve already started applying this knowledge at work. I hope to continue this learning process at a slower pace and also take the time to dive into some deeper projects that I thought of while doing my 30 days. I want to thank everyone who left comments, liked a post (here or on google+), or even just read what I wrote. Knowing that people were paying attention really kept me to my schedule and I learned a lot of useful information from people’s feedback.

Thanks,
Robb

30 Days of Python: Day 29 Visualizing Particle Filters

I’m making a small project every day in python for the next 30 days (minus some vacation days). I’m hoping to learn many new packages and  make a wide variety of projects, including games, computer tools, machine learning, and maybe some science. It should be a good variety and I think it will be a lot of fun.

Day 29: Visualizing Particle Filters for Robot Localization

Back in the spring I took Udacity’s Artificial Intelligence for Robotics class. It was a great class that covered a wide range of topics all centered around the algorithms that go into an autonomous car (the instructor, Sebastian Thrun, is the guy behind Google’s and Stanford’s self-driving car). I really enjoyed that the class had python based homework assignments. One assignment was to program a particle filter to perform localization for a robot. A particle filter uses several hundred “particles” to model a robot’s motion and then the average position acts as an estimate of the robot’s position, at least, if it doesn’t diverge. Because the course had us programming in a web interface, we couldn’t do much visualization of what was going wrong. Now that I’ve learned more about python, I thought I’d give it a crack offline.

So for today’s project, I used the python turtle package to draw the world, the robot, and the particles. You can actually watch the particles converge as they go through each motion the robot takes. Below is a GIF of the robot and particles in action. The robot moves through a sequence of moves first in blue. We have no idea where the robot is so we initialize the particles all over the world (in red) with all possible orientations. We do however know what move the robot made and how far the robot was from the four landmarks (black circles). For each step of the particle filter we move all the particles according to how the robot moved, estimate the probablity of the particle being correct by comparing its measurements to the robot’s, and then we resample the particle distribution. That resampling piece is key. The particles that are most likely to be right get resampled the most and unlikely ones don’t get sampled at all. Because both the motions and measurements are noisy, the resampled particles diverge from each other. But because any that diverge too much from the actual robot die off, the clump of particles converge around the robot. In the picture below the particles alternate between a movement and the subsequent resampling.

Robot Localization by Particle Filer

Robot Localization by Particle Filer

After just one motion,  measurement, and resampling the particles have already closed in quite a bit on the robot. Overall turtle worked really well for visualizing this. I thought about static plots with matplotlib but the animation is definitely the most important part for this. I might do another project in the future centered around more complex particle filters because they are a pretty good way to localize on a map.

 

30 Days of Python: Day 28 Interactive Unix Tools

I’m making a small project every day in python for the next 30 days (minus some vacation days). I’m hoping to learn many new packages and  make a wide variety of projects, including games, computer tools, machine learning, and maybe some science. It should be a good variety and I think it will be a lot of fun.

Day 28: Interactive Unix Tools

For today’s project, I went back to my utilities projects and modified them to work well in the interactive python shell. These are the tools that duplicated Unix functionality for windows (cat, head, grep, wc, nl, and tail). The idea was to make using windows more like using the Linux or Unix shell. The tools worked great from the windows command line but sometimes you’d prefer to be in the more powerful python interactive shell. The tools as they were always dumped the output to standard out. It wasn’t possible to chain them together and create pipelines in the python shell.

To remedy this I made each of the tools return a generator. In order to keep the windows command line functionality, I changed the main command to call a wrapped version of the function that sends the generator to standard out. One other feature I added was the ability for each tool to work with either a file object or a file name. Here’s an example of how this pattern was applied to nl the line numbering tool:

def nl_file_name(file_name):
'''Generates numbered lines for a file'''
    with open(file_name) as f_in:
        return (line for line in list(nl_file_in(f_in)))

def nl_file_in(file_in):
    '''Generates numbered lines for an open file'''
    return (str(i)+'\t'+line for i, line in enumerate(file_in))

def nl(file1):
    '''Numbers the lines in the file'''
    if isinstance(file1,str):
        return nl_file_name(file1)
    else:
        return nl_file_in(file1)

def nl_dump(file1):
    '''Numbers the lines in the file'''
    sys.stdout.writelines(nl(file1))

Note that the version that opens the file has to force the generator into a list in order to deal with the file closing. To truly create a pipeline with out worrying about memory issues, it would be best to use the tools with the files already open instead of by file name. Adding the interactive mode to the utilities, leaves me pretty satisfied with my toolset.

 

For those that are interested, here’s my desert island utilities repository: https://github.com/robb07/utilities

30 Days of Python: Day 27 Traveling Electrons

I’m making a small project every day in python for the next 30 days (minus some vacation days). I’m hoping to learn many new packages and  make a wide variety of projects, including games, computer tools, machine learning, and maybe some science. It should be a good variety and I think it will be a lot of fun.

Day 27: Traveling Electrons

For today’s project, I simulated an electron moving in electric and magnet fields. I used the vpython package which is a great tool for physical simulations because it is simple to do the vector calculations in and to pair those with visible objects in a screen. Unlike the orrery project where I used a clockwork model of the system, for this project I simulated the net force on the particle. The Lorenz Force describes the force of an electric and magnetic field on a particle charge. Here’s what it looks like in code:

while True:
    F = particle.charge*(E_field.mag*E_field.axis + particle.vel.cross(M_field.mag*M_field.axis))
    accel = F/particle.mass
    particle.vel = particle.vel + accel*DT
    particle.pos = particle.pos + particle.vel*DT

Because the magnetic force on the particle is perpendicular to both the particles velocity and the magnetic field, it creates a centripetal force on the particle sending it in a circle:

Only Magnetic Field Circle

Only Magnetic Field – Circle

If the particle has a small portion of the velocity in the same direction as the magnetic field, the result is a helix in the direction of the magnetic field:

Only Magnetic Field Helix

Only Magnetic Field – Helix

If an electric field is added on top of that the helix expands in the direction of the electric field:

Electric and Magnetic Field - Expanding Helix

Electric and Magnetic Field – Expanding Helix

If the electric field points to a side then it causes drift:

Electric and Magnetic Field - Drifting Helix

Electric and Magnetic Field – Drifting Helix

I initially had a bug in my code that made all of the shapes wrong and not match my intuition. I finally spotted it and the simulation started behaving correctly. That’s a good lesson learned to really work out what a few solutions should look like before moving on to the more complex ones. I’d like to spend more time with this one and see if I can add in varying electromagnetic fields to see what I can get the electron to do!

For those that are interested, here’s my science simulations repository: https://github.com/robb07/science_sims

30 Days of Python: Day 26 Cryptoquip

I’m making a small project every day in python for the next 30 days (minus some vacation days). I’m hoping to learn many new packages and  make a wide variety of projects, including games, computer tools, machine learning, and maybe some science. It should be a good variety and I think it will be a lot of fun.

Day 26: Cryptoquip

For today’s project, I made a cryptoquip game. This is my favorite newspaper puzzle where the goal is to decode a witty saying that’s been encrypted with a substitution cipher. Spaces and punctuation are left in tack so it’s all about guessing which letters are which by looking for letters that might be by themselves (a, I) or double letters (often e, l, etc.). It’s a fun puzzle to play. Here’s a screenshot of one that’s nearly solved:

Einstein Quote

The encryption method is very easy to implement. The key is a dictionary with a random shuffle of a ciphertext alphabet mapped to the plaintext alphabet. The ciphertext (the encoded message) uses all caps, while the plaintext uses lower case. A simple list comprehension can do the substitution:

def encrypt_substitute(message, key):
    '''Encrypts the message with a substitution cipher'''
    return ''.join([key[m] if m in key else m for m in message])

For the guessing portion, the player moves the cursor around with the arrows and can work in either the message or on the side in the key. The working key is initialized with all underscores for each ciphertext letter. The same encryption method does the encryption with the reversed key. I wanted to add the ability click on the letter you want to change but I ran out of time. Perhaps I’ll come back and add that after the 30 days are up, which isgetting quite close.

For those who are interested here’s my github repo for the games project: https://github.com/robb07/python_games