Optimization with IPython

I just discovered the easy optimization features of IPython and thought I’d share how I used them to speed up my deep learning code. For a full explanation of the deep learning code see my previous posts about some projects that I’m building it for. This is going to be a speed comparison of the original R code, my first implementation in Python, and finally the optimized version in Python.

Over that last few weeks I’ve been exploring the various features of the IPython shell. I like its tab-to-complete, its easy help syntax (?), its magic commands (with %), and its ability to access the shell (with !). But I’m really excited about the latest set of features I found: the optimization tools. You can profile your code with just a single line and not have to worry about re-writing all of the code to do it. I’ve optimized code before (usually with timers) and I’ve used profilers before like the built-in one for Netbeans with Java and I know that the profile package and timeit package are python packages not IPython but the ease of use is really what does it for me. I don’t want to invest a ton of time and effort into finding out that my code is already as fast as its going to get or that I’m looking in the wrong place. I want an answer now and I want to be able to play around with it to dig into what’s going on. IPython provides that with %timeit and %prun. Here’s how I used these commands to get my 30 hour python code down to 6 minutes.

The Problem:

First off the code I was optimizing trains a machine learning model using gradient descent. To do that I have two key functions with many supporting ones: optimize and a gradient function (cd1 in this case). The optimize function performs gradient descent using the model and gradient function. It takes a chunk of data calls the gradient function and then applies a weight update to the model over and over again. Here’s the code:


def optimize(model, gradient_function, training_data, momentum=0.9, learning_rate=0.1, N_epochs=10, mini_batch_size=100):
    '''Trains a model with gradient descent'''
    model = {key:np.copy(val) for key, val in model.iteritems()}
    momentum_speed = {key:np.zeros(np.shape(val)) for key, val in model.iteritems()}
    data_len = np.shape(training_data['data'])[1]
    for epoch in range(N_epochs):
        for mini_batch in extract_mini_batch(training_data, mini_batch_size):
            gradient = gradient_function(model, mini_batch)
            for key in model.iterkeys():
                momentum_speed[key] = momentum*momentum_speed[key] + gradient[key]
                model[key] += momentum_speed[key]*learning_rate#-weightcost*model[key]
    return model

def cd1(model, data):
    '''Computes one iteration of contrastive divergence on the rbm model'''
    N_cases = np.shape(data['data'])[1]
    
    vis_prob_0 = data['data']
    vis_states_0 = sample_bernoulli(vis_prob_0)
    
    hid_prob_0 = logistic(model['W']*vis_states_0 + model['fwd_bias'])
    hid_states_0 = sample_bernoulli(hid_prob_0)
    
    vis_prob_n = logistic(model['W'].T*hid_states_0 + model['rev_bias'])
    vis_states_n = sample_bernoulli(vis_prob_n)
    
    hid_prob_n = logistic(model['W']*vis_states_n + model['fwd_bias'])
    
    vh0 = hid_states_0 * vis_states_0.T / N_cases
    vh1 = hid_prob_n * vis_states_n.T / N_cases
    cd1_value = vh0 - vh1
    
    model_delta = dict([('W', cd1_value),
                        ('fwd_bias',np.mean(hid_states_0 - hid_prob_n, axis=1)),
                        ('rev_bias',np.mean(vis_prob_0 - vis_prob_n, axis=1))])
    return model_delta

When I first ran this code on to train a model on the full MNIST data for the Kaggle competition it took 30 hours to complete the training on the 42000 data samples. I could have just accepted that but I had a suspicion that it could be faster because I was pretty sure the R code version ran faster than that. So I setup a fair comparison and implemented the exact same problem in R with the old code. It ran in 30 minutes.

Before I found the IPython tools, I recorded a bunch of data to see how fast the code was running with different amounts of data. I found that the R code scaled linearly while the Python code scaled exponentially. I also found that if instead of splitting the data up into mini-batches, I kept it all as one batch the Python was linear and faster than the R. All of this led me to scour the internet for advise on optimizing python code, specifically numpy code. I found some good tips from the scikit-learn site, scipy lectures, and a good example of the results of optimization.

Profiling in IPython:

First I timed my function on a small subset of the data (700 samples, with 100 samples per mini-batch). I kept this for comparison later to see if the changes really do help the problem. To do this I just put %timeit in front of my call to the optimize function.


%timeit model_rbm1 = dbn.optimize(model_rbm1_init, dbn.cd1, train, learning_rate=lr_pretrain, N_epochs=N_epochs_pretrain)
1 loops, best of 3: 18.6 s per loop

Next I run the profiler, which will compute how much time is spent in each function that gets called. For this I just add %prun in front of the command (-l 10 limits it to the top 10 calls):


#2000 cases

%prun -l 10 model_rbm1 = dbn.optimize(model_rbm1_init, dbn.cd1, train, learning_rate=lr_pretrain, N_epochs=N_epochs_pretrain)
   106298 function calls in 150.431 seconds

   Ordered by: internal time
   List reduced from 38 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
     6597   44.723    0.007   44.737    0.007  {numpy.core._dotblas.dot}
     1800   36.143    0.020   36.178    0.020  dbn.py:10(logistic)
     1800   28.197    0.016   28.197    0.016  {method 'rand' of 'mtrand.RandomState' objects}
      600   20.137    0.034  146.362    0.244  dbn.py:31(cd1)
     1800   11.819    0.007   44.837    0.025  dbn.py:26(sample_bernoulli)
     1800    4.798    0.003    4.808    0.003  {method 'astype' of 'numpy.ndarray' objects}
        1    3.174    3.174  150.431  150.431  dbn.py:69(optimize)
     1200    1.201    0.001    1.207    0.001  {method 'reduce' of 'numpy.ufunc' objects}
    26397    0.090    0.000    0.101    0.000  defmatrix.py:290(__array_finalize__)
     1200    0.040    0.000    1.267    0.001  _methods.py:49(_mean)

The output is ordered by tottime, which is the total time spent in that function not counting functions that it calls. From this I can see that most time is spent in the numpy dot product routine as expected and the next most is in the logistic function that I wrote. So that’s what I investigated next. Now to do this I had to seperate it out from the code that called it but pass it some dummy data to take the arguments place. I used numpy.rand for that.

#for loop to get the number of calls comparable to dbn.optimize
%prun for i in range(2000): dbn.logistic(np.dot(np.random.rand(300,784),np.random.rand(784,100)) + np.random.rand(300,1))
   10003 function calls in 10.278 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
     6000    7.535    0.001    7.535    0.001  {method 'rand' of 'mtrand.RandomState' objects}
     2000    1.585    0.001    1.585    0.001  {numpy.core._dotblas.dot}
     2000    0.846    0.000    0.846    0.000  dbn.py:10(logistic)
        1    0.311    0.311   10.278   10.278  <string>:1(<module>)
        1    0.000    0.000    0.000    0.000  {range}
        1    0.000    0.000    0.000    0.000  {method 'disable' of '_lsprof.Profiler' objects}

No matter what I tried I couldn’t get logistic to be as slow as it was in the optimize routine (cumulative time of 0.9s vs. 36.2s) so I backed up and tried timing the function that calls it: cd1. I figured there must be something about how it is called that slows it down. Here’s that profile run:

#for loop to get the number of calls comparable to dbn.optimize
%prun for i in range(0, data_len, mini_batch_size)*30: v = dbn.cd1(model_rbm1, {'data':(train['data'][:,i:i+mini_batch_size])})
    81805 function calls in 3.944 seconds

    Ordered by: internal time

    ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
      3000    1.327    0.000    1.330    0.000  {numpy.core._dotblas.dot}
       600    1.264    0.002    3.733    0.006  dbn.py:31(cd1)
      1800    0.462    0.000    0.470    0.000  dbn.py:10(logistic)
      1800    0.459    0.000    0.459    0.000  {method 'rand' of 'mtrand.RandomState' objects}
         1    0.204    0.204    3.944    3.944  <string>:1(<module>)
      1800    0.073    0.000    0.579    0.000  dbn.py:26(sample_bernoulli)
      1800    0.042    0.000    0.044    0.000  {method 'astype' of 'numpy.ndarray' objects}
      1200    0.032    0.000    0.034    0.000  {method 'reduce' of 'numpy.ufunc' objects}
     21600    0.023    0.000    0.027    0.000  defmatrix.py:290(__array_finalize__)
      1200    0.018    0.000    0.062    0.000  _methods.py:49(_mean)

Again this isn’t as slow as it was when it was part of the optimize call (cumulative time of 3.7s vs. 146s) but working up this path seemed promising so I looked at how cd1 was called in the optimize function:

#gradient_function = cd1 for this call to optimize
for mini_batch in extract_mini_batch(training_data, mini_batch_size):
    gradient = gradient_function(model, mini_batch)
    ...

There’s nothing especially interesting about the gradient function line, but the for loop that feeds it uses a generator function. I tried swapping that out for a plain slicing of the training data and voila:

%timeit model_rbm1 = dbn.optimize(model_rbm1_init, dbn.cd1, train, learning_rate=lr_pretrain, N_epochs=N_epochs_pretrain)
1 loops, best of 3: 3.18 s per loop
# compared to 18.6s per loop before

The generator function was the culprit, not because it consumes a lot of resources (they are actually very efficient), but because I implemented it poorly:

#before
def extract_mini_batch(data, mini_batch_size=100):
    '''Generates mini_batches from the data'''
    data_len = np.shape(data['data'])[1]
    return (dict([(key,
                   val[:,i:i+mini_batch_size] if len(np.shape(val))>2 and np.shape(val)[1]==data_len else val)
                   for key, val in data.iteritems()])
            for i in xrange(0, data_len, mini_batch_size))

#after
def extract_mini_batch(data, mini_batch_size=100):
    '''Generates mini_batches from the data'''
    data_len = np.shape(data['data'])[1]
    return (dict([(key, val[:,i:i+mini_batch_size])
                   for key, val in data.iteritems()])
            for i in xrange(0, data_len, mini_batch_size))

Using an if statement, even the use of the ternary operator, breaks the ability of the interpreter to compile this into efficient code. I had to give up the ability to store some extra variables with the data but they were extraneous and the optimization is definitely worth it. Plus it is more readable after the change.

Results

Below are the results of the optimization. You can see the exponential curve of the original Python code and the improved linearly scaling Python code. Also note that R code scales worse than the Python code (linearly but not as good of a coefficient). A fully timed run for the R code takes 30 minutes while the new Python code only takes 6 minutes for all 42000 data samples.

Python vs  R Deep Learning

While it may be possible to get further improvements out of Python and R, those improvements would be hard to find, at least for me. I’m satisfied with how fast it runs and am glad to know how easy it is to profile code in python. I had been wondering for a while now how Python and R compared in speed and while there are resources out there testing it myself let’s me know what the level of effort is to get good fast code. I think Python wins on that one because between the straight forward numpy matrix syntax and the handy profiling commands it really only took me a couple days to clean up the code and get it running fast. It took a look more work to get the R code running as it is.

 

Advertisements